Optimizing Direct X On Multi Core Architectures

  • 5,816 views
Uploaded on

This slide set covers best practices in designing threaded rendering in PC games. Examples of current PC titles will be used throughout the talk to highlight the various points.

This slide set covers best practices in designing threaded rendering in PC games. Examples of current PC titles will be used throughout the talk to highlight the various points.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • good!!
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,816
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
92
Comments
1
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Game Developers Conference 2008 Optimizing DirectX on Multi-core architectures Leigh Davies Senior Application Engineer, INTEL February 2008 [email_address]
    • Contributions from;
      • David Potages Grin*
      • Jeff Andrews Intel ®
      • Rita Turkowski Intel ®
      • Kev Gee Microsoft*
    *Other names and brands may be claimed as the property of others
  • 2. Legal Disclaimer
    • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
    • Intel may make changes to specifications and product descriptions at any time, without notice.
    • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
    • Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
    • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
    • Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
    • *Other names and brands may be claimed as the property of others.
    • Copyright © 2008 Intel Corporation.
  • 3. Agenda
    • Graphics and the CPU
    • Profiling Graphics and Drivers
    • Threading the render thread
    • Case Study GRIN *
    • Summary
    *Other names and brands may be claimed as the property of others
  • 4. Graphics is CPU Intensive. World in Conflict* Bionic Commando* D3D Runtime and Driver account for 25-40% of CPU cycles per frame *Other names and brands may be claimed as the property of others **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory. Application D3D Runtime Driver Other Legend Crysis* CPU Benchmark Crysis* GPU Benchmark
  • 5. Designing the Rendering Pipeline.
    • Analyze the whole program
      • Your Application
      • Direct API usage and overheads
      • Video card driver
    • Have Defined Performance Goals
      • Use key game play targeted scenarios for perf analysis
      • Build benchmarks / test levels
    Application Direct3D * Runtime Command Buffer Software Driver Video Card **Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx Render Functions *Other names and brands may be claimed as the property of others World in Conflict* 510-700 ZFUNC 1050-1150 DrawPrimative 2500-3100 SetTexture 1500-9000 SetPixelShaderConstant 3000-12100 SetVertexShader Cycles count DX9 API Call**
  • 6. Balancing Future Workloads Intel ® Roadmap Graphics Compaction/Derivative Intel Core™ Duo · Pentium-D Intel Core™ Microarchitecture Intel Core™2 Duo, DC Intel Xeon® 5100 65nm 2 YEARS 45nm 2 YEARS Compaction/Derivative PENRYN New Microarchitecture NEHALEM Tick Tick Tock Tock Scalable & Configurable Cache, Interconnects & Memory Controllers Scalable Performance: 1 to 8 Threads & 1 to 4 Cores
  • 7.
    • Be realistic, Rendering Costs CPU Time
    • Rendering thread potential bottleneck for N-Core scaling
    • Rendering costs likely to increase as you add more physics, effects or even AI objects
    • Runtime and driver costs are significantly higher on the PC than the consoles
    • Use Performance Analysis results to focus development efforts
    • Analyze regularly and catch regressions early
    Time is Money Optimise the graphics thread. Offload as much as possible.
  • 8. Agenda
    • Graphics and the CPU
    • Profiling Graphics and Drivers
    • Threading the render thread
    • Case Study GRIN
    • Summary
  • 9. Overview of Graphics Driver Models
    • Windows * XP Display Model XPDM - DX* -  DX9
      • The Kernel mode driver controls threading
    • Windows Vista * Display Driver Model WDDM - DX9
      • The D3D9 runtime manages creation of threads
      • One is created specifically for the User Mode Driver (UMD)
    • Windows Vista Display Driver Model WDDM - DX10
      • The Driver is responsible for creating threads
      • Currently released drivers don’t thread
      • Could change in the near future
    Graphics driver can have a major impact on performance and multi-core scaling. *Other names and brands may be claimed as the property of others
  • 10. Profiling Tools
    • Need to use a variety of tools;
      • Use repeatable workload
    • CPU Tools;
      • VTune ™ Performance Analyser.
      • Intel® Thread Profiler
      • PIX for Windows *
      • AMD Code Analyst ™
    • GPU Tools;
      • PIX for Windows with vendor plugins
      • NVIDIA * Perfhud
      • ATI * PerfStudio
    *Other names and brands may be claimed as the property of others
  • 11. Profiling Graphics with VTune™ Analyzer
    • Select Counter Monitor for a quick overview;
    • Not necessary to launch the app
    • Disable display of counter data unless running windowed
    • Profile across a selection of configurations
      • Identify different bottlenecks based on h/w limitations
      • “ Works great on my machine” isn’t good enough
  • 12. VTune™ Performance Analyzer - Sampling
    • Calibration isn’t needed for games
    • Delay sampling allows alt-tab or bypass loading
    • Tracking core usage needs to be added
    • Privileged time shows time inside Kernel
  • 13. VTune ™ Analyzer Views
    • Processor Usage
    • Memory Usage
    • Context Switching
    • CPU Frequency
    VTune ™ Analyzer allows you to add your own counters.
  • 14. Sampling - Display Model XPDM Application D3D Runtime Win32k & Dxg Display Driver Miniport Driver Videoport Kernel Mode User Mode Session Space
  • 15. Sampling - Display Model WDDM Application D3D Runtime Win32k User Mode Driver Kernel Driver Dxgkrnl Kernel Mode User Mode DWM Process DWM Application Process CDD Session Space
  • 16. Associating Symbols in VTune ™ Analyzer
    • Configure->Options->Directories->Symbol Repository
    • View Symbol Repository->Delete unassociated modules
    • In Tuning Browser select "Results" -> "Module Associations..."
    • Edit symbol associations
  • 17. Symbol Information for DX10Core.dll Symbols Taken while profiling SoftParticle Sample on SDK
  • 18. PIX for Windows CPU GPU
    • Gathering GPU events requires Windows Vista
    • Cross over between PIX and VTune ™ Counters
    • Easy to see CPU/GPU headroom
  • 19. Intel ® PIX Plug-in: Beta Available Now
    • Provides access to Intel ® Counters in PIX
    • Rollout now to support IIG Profiling
    Description Metric Name # The aggregated percentage of time that the texture units were actively processing texels. Texture Unit(s) Utilization 16 The aggregated percentage of time that the mathbox was actively executing instructions. Mathbox Utilization 15 The number of pixels that were actually written to the render target. Pixels Drawn 14 The number of texels that were fetched by the pipeline. Texel Count 13 The number of triangles that flowed through the pipeline prior to any clipping or culling. Triangle Count 12 The number of vertices that entered the pipeline. Vertex Count 11 The percentage of time that the core array is actively executing instructions. Cores Active 10 The percentage of time that any core in the array is either actively executing instructions or stalled. Cores Busy 9 The percent utilization of the front end of the GPU.  This metric shall describe the incoming command stream and does NOT describe the utilization of the array of execution units (cores). GPU Busy 8 The amount of texture memory currently utilized, normalized to MB. Texture Memory Used 7 The amount of graphics memory currently utilized, normalized to bytes. Graphics Memory Used - bytes 6 The amount of graphics memory currently utilized, normalized to MB. Graphics Memory Used – MB 5 The amount of time spent in the display driver either busy stalled or in a sleep state, normalized to milliseconds. Driver Time Stalled 4 The amount of time spent in the display driver, normalized to milliseconds. Driver Time 3 Instantaneous frame rate normalized to seconds. (inverted frame time). Frames per Second 2 Instantaneous frame time in milliseconds. Frame Time 1
  • 20. Agenda
    • Graphics and the CPU
    • Profiling Graphics and Drivers
    • Threading the render thread
    • Case Study GRIN
    • Summary
  • 21. Starting Points
    • Common Issues:
      • Naive Ports to Windows from console models
      • Excessive context switching/synchronization overhead
      • Work starvation due to thread sync dependencies
    • General Rules
      • Use only 1 heavy weight thread per Core on Windows
      • Manage Job distribution
      • The OS scheduler knows best
      • Consider memory bandwidth
    • Multi-core and D3D Usage
      • Avoid Use of the D3DCREATE_MULTITHREADED flag
      • You CAN manage synch costs better
      • Design around a single threaded D3D Device Access model
      • Lock resources from main thread, manually protect access
    *Other names and brands may be claimed as the property of others
  • 22. Making the Drivers Work for You!
    • Pack your DrawPrimitive2 calls together
    • Frequently creating & destroying shaders, VB, IB, and surfaces will impact performance
    • Avoid allocating too many system memory resources
    • DrawPrimitiveUP or DrawIndexedPrimitiveUP
    App App D3D Runtime D3D Driver D3D Driver
    • Potential 20%+ speed gain.
    • Can be disabled by application behaviour.
    • Producer & Consumer threads dispatch commands to GPU
  • 23.
    • Avoid any calls that return GPU state information, requires a CPU thread synchronization
    • Driver Queries are OK (calls are asynchronous)
    • Do not lock threads to a specific CPU!
    • Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter them among drawing calls
    • Minimize use of any locks/unlocks
    • System Memory Vertex Buffers
      • D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY
      • Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE
    Making the Drivers Work for You!
  • 24. Threading Issues
    • Race Conditions between threads.
      • Object Updates
      • Creation/deletion of objects
    • False sharing of data between threads.
    • Accessing hardware resources.
    Render Thread Main Thread Time (Frame n) (Frame n-1) Move Object X Render Object X Delete Object Y Render Object Y
  • 25. Threading Options Front- End Logic EOF EOF Front- end Logic Back-end Render Cmd Queue Back-end Render
    • Avoiding the Issues
      • Use an update queue, lightweight (lock-free?)
      • Make duplicate objects/ double-buffered
      • Reference count objects
    Pipeline Consumer thread
  • 26. Buffering Dynamic Data
    • Partially buffered locks consume more video memory.
    • Fully Buffered consume more system memory and have an associated CPU cost for memory copying.
    Fully buffered locks Partially buffered locks Render Thread Main Thread (Frame n) (Frame n-1) Modify Vertex Buffer 0 Render Object from Vertex Buffer 1 Render Thread Main Thread Modify Vertex Buffer 1 Render Object from Vertex Buffer 0 (Frame n+1) (Frame n) Main Thread Render Thread
  • 27. Sub Threading Options Front- End Logic EOF Back-end Render Job Job Job Job Queue
    • Job Queue offloads
      • Software Visibility Culling
      • Particle generation
      • Character Skinning
      • Procedural updates
    • Reduces path size through both front and back ends
    Job Job Job Job Queue
  • 28. Threading the DX API
    • Similar to DX9 threading in the runtime
      • Potentially repeating the same work
    • Potential to move simple API code out of main thread, i.e. state management
    • DX10 has lower runtime costs
    D3D9Wrapper D3DVertexBuffer9 Wrapper D3DDevice9 Wrapper DX9 Render System D3D9 D3DDevice9 D3DVertexBuffer9 Graphics Driver Graphics Device DX9 DX10 16% increase* 39% increase* * Theoretical increase based on amount of API work offloaded, does not include threading overhead** **Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory. 19.35 Other threads 10.91 Physics 23.02 NVIDIA driver 46.46 (15.82%) in DX9 Main Thread 21.88 Other threads 13.95 Physics 63.84 (28.39% in DX10+Driver) Main Thread 7.38 DX API Thread 19.35 Other threads 10.91 Physics 23.02 NVIDIA driver 39.08 Main Thread 18.12 DX API Thread 21.88 Other threads 13.95 Physics 45.72 Main Thread
  • 29. Agenda
    • Graphics and the CPU
    • Profiling Graphics and Drivers
    • Threading the render thread
    • Case Study GRIN
    • Summary
    *Other names and brands may be claimed as the property of others
  • 30. Case study: Grin’s engine * *Other names and brands may be claimed as the property of others David Potages Senior Engine Architect, GRIN February 2008 [email_address] *Performance figures discussed in this case study refer to a pre release version of the game. They are subject to change before release and are for illustration only.
  • 31. Quick Engine Overview
    • 3 rd generation of threaded engine
    • 2 nd generation of threaded renderer
    • Used in several games
  • 32. Quick Engine Overview
    • Not game specific: game code in Lua scripts
    • Allows hot-reload, no link time, custom debugger
    • But single threaded, a lot of memory allocations
    • Deferred rendering
      • DX9 – DX10 being implemented
    • Libraries:
      • PhysX ™
      • OpenAL
      • Bink*
    All the technology choices have great impact on the possible parallelization! *Other names and brands may be claimed as the property of others
  • 33. Why multi-threading?
    • Poor CPU usage
      • Can go down to 30%
    • A lot of time spent in D3D/driver
      • 35-45%*
    • But a lot of the application time is dedicated to rendering
      • Up to 37%*
      • Grand total of 53%* of frame with D3D/driver
    *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate. Application D3D Runtime Driver Other Legend
  • 34. Why multi-threading the renderer?
    • Simplified pipeline (ST version)
    • Rendering is an easy target for multithreading: low system dependencies, 53% of frame time
    • But easier said than done!
    Some systems or the drivers they use can take advantage of multi-cores Rendering has low dependencies with other systems, but big data dependencies *Other names and brands may be claimed as the property of others Culling Particles batch optimizations Rendering World update Script update Sound Network Lua * PhysX ™ OpenAL *
  • 35. Implementation Details
    • Main thread
      • Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main)
    • Renderer thread
      • Culling (including software occlusion queries)
      • Particle effects batch optimizations
      • RenderDevice (D3D)
      • Win32 messaging
    • Other
      • File streaming
      • PhysX ™ threads
      • Driver threads
  • 36. Implementation Details
    • Messages sent to the renderer
      • Non blocking:
        • render_scene
        • render_frame
        • update_window
        • Etc
      • Blocking:
        • flush_pipe
    • flush_pipe forces the renderer to execute all the queued jobs => synchronization point
      • Used between frames on main thread
      • Can be used to ensure that data (eg Textures) is ready
    Front- end Logic Back-end Render Flush Back-end Render Idle Front- end Logic Sync Idle Flush
  • 37. Implementation Details
    • States needs to be mirrored
    • States changes are queued, and updated in the freeze
    • The proper state is returned depending on the calling thread
    This will avoid contention when data is accessed in the renderer, but mirror only what is required
  • 38. Results
    • Better CPU usage 40-60%*
    • Better threads workload
    *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
  • 39. Results: Rendering Performance
    • Better FPS
      • 4C MT is 1.88x faster than 1C *
      • 4C MT is 1.20x faster than 4C ST *
    • Analysis
      • Remember that the drivers are partially threaded: we save up to 17% + %of D3D/driver time that is not threaded
      • Close to 1.20x
      • if D3D/driver were completely threaded, new frame time would be 1-0.17=83% less, and the scale-up : fps new /fps old =time old /time new =time old /(time old *0.83)=1.20
      • Maximum scale-up vs. 1C is 2.12x
      • Context switches, cache misses and contention slow us down.
      • Render-thread bound
    *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
    • Effect on a low physics/gameplay workload
  • 40. Improvements
    • Threading some parts of the render thread
        • E.g.: culling (~9-25%* of the render thread)
    • Reducing contentions
        • Mainly memory
    • Batch more
        • E.g.: Effects
    • Triple buffering?
    *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
  • 41. Scalability
    • We can push for instance more physics/effects, while we are render-thread bound, or more AI
    • But hard to find the right balance between CPU and GPU workload!
    • Example: falling cars aka pushing more physics
  • 42. Scalability
    • ~256 cars falling and bouncing
    • 4C MT is 1.42x* faster than 4C ST, and 3.23x* faster than 1C
    • PhysX ™ helped us a lot to propagate the workload, but occupies the other cores quite heavily, thus preventing D3D/drivers to take advantage of them.
    • Rendering overhead was not that big with the additional units since they batch well.
    *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
  • 43. Issues
    • A proper benchmark system is required
      • A fly-through benchmark is not enough!
      • The CPU & GPU workloads vary a lot on different maps
    • Easy to forget a data that needs to be mirrored
    • Lockfree algorithm are nice, but to be used with care
    • Memory contention + cache misses + false sharing
    • Behaviour of drivers varies quite alot…
  • 44. Agenda
    • Graphics and the CPU
    • Profiling Graphics and Drivers
    • Threading the render thread
    • Case Study GRIN
    • Summary
    *Other names and brands may be claimed as the property of others
  • 45. Summary/Conclusion
    • Graphic pipeline is still very CPU intensive
    • Future CPUs will have increasing logical processors
    • It is worth threading your renderer as much as possible if you want to be able to push more things in your game
    • Hard to balance the workloads though, need to profile whole system
    • Making the most of the graphics driver essential
  • 46. References:
    • Accurately Profiling Direct3D API Calls.
      • msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx
    • Debugging Tools and Symbols: Getting Started
      • www.microsoft.com/whdc/devtools/debugging/debugstart.mspx
    • Threading the OGRE3D Render System
      • www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm
  • 47.