Threading Successes 03 Gamebryo


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Floodgate is a cross platform stream processing engine that enables developers to exploit the data-processing power of multi-processor platforms.
  • Threading Successes 03 Gamebryo

    1. 1. Emergent Game Technologies Gamebryo Element Engine Thread for Performance
    2. 2. Goals for Cross-Platform Threading <ul><li>Play well with others </li></ul><ul><li>Take advantage of platform-specific performance features </li></ul><ul><li>For engines/middleware, be adaptable to the needs of customers </li></ul>
    3. 3. Write Once, Use Everywhere <ul><li>Underlying multi-threaded primitives are replicated on all platforms </li></ul><ul><ul><li>Define cross-platform wrappers for these </li></ul></ul><ul><li>Processing models can be applied on different architectures </li></ul><ul><ul><li>Define cross-platform systems for these </li></ul></ul><ul><li>Typical developer writes once, yet code performs well on all platforms </li></ul>
    4. 4. Emergent's Gamebryo Element <ul><li>A foundation for easing cross-platform and multi-core development </li></ul><ul><ul><li>Modular, customizable </li></ul></ul><ul><ul><li>Suite of content pipeline tools </li></ul></ul><ul><ul><li>Supports PC, Xbox, PS3 and Wii </li></ul></ul><ul><li>Booth # 5716 - North Hall </li></ul>
    5. 5. Cross-Platform Threading Requires Common Primitives <ul><li>Threads </li></ul><ul><ul><li>Something that executes code </li></ul></ul><ul><ul><li>Sub issues: local storage, priorities </li></ul></ul><ul><li>Data Locks / Critical sections </li></ul><ul><ul><li>Manage contention for a resource </li></ul></ul><ul><li>Atomic operations </li></ul><ul><ul><li>An operation that is guaranteed to complete without interruption from another thread </li></ul></ul>
    6. 6. Choosing a Processing Model <ul><li>Architectural features drive choice </li></ul><ul><ul><li>Cache coherence </li></ul></ul><ul><ul><li>Prefetch on Xbox </li></ul></ul><ul><ul><li>SPUs on PS3 </li></ul></ul><ul><ul><li>Many processing units </li></ul></ul><ul><ul><li>General purpose GPU </li></ul></ul><ul><li>Stream Processing fits these properties </li></ul><ul><ul><li>Provide infrastructure to compute this way </li></ul></ul><ul><ul><li>Shift engine work to this model </li></ul></ul>
    7. 7. Stream Processing (Formal) ‏ Wikipedia: Given a set of input and output data (streams), the principle essentially defines a series of computer-intensive operations (kernel functions) to be applied for each element in the stream. Input 1 Kernel 1 Input 2 Kernel 2 Output
    8. 8. Generalized Stream Processing <ul><li>Improve for general purpose computing </li></ul><ul><ul><li>Partition streams into chunks </li></ul></ul><ul><ul><li>Kernels have access to entire chunk </li></ul></ul><ul><ul><li>Parameters for kernels (fixed inputs) ‏ </li></ul></ul><ul><li>Advantages </li></ul><ul><ul><li>Reduce need for strict data locality </li></ul></ul><ul><ul><li>Enables loops, non-SIMD processing </li></ul></ul><ul><ul><li>Maps better onto hardware </li></ul></ul>
    9. 9. Morphing+Skinning Example Morph Target 1 Vertices Morph Weights Morph Kernel (MK) ‏ Skin Vertices Bone Matrices Blend Weights Skinning Kernel (SK) ‏ Vertex Locations Morph Target 2 Vertices
    10. 10. Morphing+Skinning Example MW Fixed MK Instance 1 Matrices Fixed Weights Fixed Verts Part 1 MT 1 V Part 1 MT 1 V Part 2 MT 2 V Part 1 MT 2 V Part 2 MK Instance 2 Skin V Part 1 Skin V Part 2 SK Instance 1 SK Instance 2 Verts Part 2
    11. 11. Floodgate <ul><li>Cross platform stream processing library </li></ul><ul><li>Optimized per-platform implementation </li></ul><ul><li>Documented API for customer use </li></ul><ul><li>Engine uses the same API for built in functionality </li></ul><ul><ul><li>Skinning, Morphing, Particles, Instance Culling, ... </li></ul></ul>
    12. 12. Floodgate Basics <ul><li>Stream: A buffer of varying or fixed data </li></ul><ul><ul><li>A pointer, length, stride, locking </li></ul></ul><ul><li>Kernel: An operation to perform on streams of data </li></ul><ul><ul><li>Code implementing “Execute” function </li></ul></ul><ul><li>Task: Wrapper a kernel and IO streams </li></ul><ul><li>Workflow: A collection of Tasks processed as a unit </li></ul>
    13. 13. Kernel Example: Times2 // Include Kernel Definition macros #include <NiSPKernelMacros.h> // Declare the Timer2Kernel NiSPDeclareKernel(Times2Kernel) ‏
    14. 14. Kernel Example: Times2 #include &quot;Times2Kernel.h&quot; NiSPBeginKernelImpl(Times2Kernel) ‏ { // Get the input stream float * pInput = kWorkload.GetInput< float > (0); // Get the output stream float * pOutput = kWorkload.GetOutput< float > (0); // Process data NiUInt32 uiBlockCount = kWorkload.GetBlockCount(); for (NiUInt32 ui = 0; ui < uiBlockCount; ui++) ‏ { pOutput[ui] = pInput[ui] * 2; } } NiSPEndKernelImpl(Times2Kernel) ‏
    15. 15. Life of a Workflow <ul><li>1. Obtain Workflow from Floodgate </li></ul><ul><li>2. Add Task(s) to Workflow </li></ul><ul><li>3. Set Kernel </li></ul><ul><li>4. Add Input Streams </li></ul><ul><li>5. Add Output Streams </li></ul><ul><li>6. Submit Workflow </li></ul><ul><li>… Do something else … </li></ul><ul><li>7. Wait or Poll when results are needed </li></ul>
    16. 16. Example Workflow // Setup input and output streams from existing buffers NiTSPStream< float> inputStream(SomeInputBuffer, MAX_BLOCKS); NiTSPStream< float> outputStream(SomeOutputBuffer, MAX_BLOCKS); // Get a Workflow and setup a new task for it NiSPWorkflow* pWorkflow = NiStreamProcessor::Get()->GetFreeWorkflow(); NiSPTask* pTask = pWorkflow->AddNewTask(); // Set the kernel and streams pTask->SetKernel(&Times2Kernel); pTask->AddInput(&inputStream); pTask->AddOutput(&outputStream); // Submit workflow for execution NiStreamProcessor::Get()->Submit(pWorkflow); // Do other operations... // Wait for workflow to complete NiStreamProcessor::Get()->Wait(pWorkflow);
    17. 17. Floodgate Internals <ul><li>Partitioning streams for Tasks </li></ul><ul><li>Task Dependency Analysis </li></ul><ul><li>Platform specific Workflow preparation </li></ul><ul><li>Platform specific execution </li></ul><ul><li>Platform specific synchronization </li></ul>
    18. 18. Overview of Workflow Analysis <ul><li>Task dependencies defined by streams </li></ul><ul><li>Sort tasks into stages of execution </li></ul><ul><ul><li>Tasks that use results from other tasks run in later stages </li></ul></ul><ul><ul><li>Stage N+1 tasks depend on output of Stage N tasks </li></ul></ul><ul><li>Tasks in a given stage can run concurrent </li></ul><ul><li>Once a stage has completed, the next stage can run </li></ul>
    19. 19. Analysis: Workflow with many Tasks Task 1 Stream A Stream B Task 2 Stream C Stream D Task 3 Stream E Stream F Task 4 Stream B Stream D Stream G Task 6 Stream G Stream F Stream I Task 7 Sync Task 5 Stream G Stream H
    20. 20. Analysis: Dependency Graph Stage 0 Stage 1 Stage 2 Stage 3 Task 1 Stream A Task 4 Stream B Task 2 Stream C Task 3 Stream E Stream D Task 5 Stream G Task 6 Stream F Sync Task Stream H Stream I Sync Stream G
    21. 21. Performance Notes <ul><li>Data is broken into blocks -> Locality </li></ul><ul><ul><li>Good cache performance </li></ul></ul><ul><ul><li>Optimize size for prefetch or DMA transfers </li></ul></ul><ul><ul><li>Fits in limited local storage (PS3) ‏ </li></ul></ul><ul><li>Easily adapt to #cores </li></ul><ul><ul><li>Can manage interplay with other systems </li></ul></ul><ul><li>Kernels encapsulate processing </li></ul><ul><ul><li>Good target for optimization, platform-specific </li></ul></ul><ul><ul><li>Clean solution without #if </li></ul></ul>
    22. 22. Usability Notes <ul><li>Automatically manage data dependency and simplify synchronization </li></ul><ul><li>Hide nasty platform-specific details </li></ul><ul><ul><li>Prefetch, DMA transfers, processor detection, ... </li></ul></ul><ul><li>Learn one API, use it across platforms </li></ul><ul><ul><li>Productivity gains </li></ul></ul><ul><ul><ul><li>Helps us produce quality documentation and samples </li></ul></ul></ul><ul><ul><li>Eases debugging </li></ul></ul>
    23. 23. Exploiting Floodgate in the Engine <ul><li>Find tasks that operate on a single object </li></ul><ul><ul><li>Skinning, morphing, particle systems, ... </li></ul></ul><ul><li>Move these to Floodgate: Mesh Modifiers </li></ul><ul><ul><li>Launch at some point during execution </li></ul></ul><ul><ul><ul><li>After updating animation and bounds </li></ul></ul></ul><ul><ul><ul><li>After determining visibility </li></ul></ul></ul><ul><ul><ul><li>After physics finishes ... </li></ul></ul></ul><ul><ul><li>Finish them when needed </li></ul></ul><ul><ul><ul><li>Culling </li></ul></ul></ul><ul><ul><ul><li>Render </li></ul></ul></ul><ul><ul><ul><li>etc </li></ul></ul></ul>
    24. 24. Same applications, new performance ... <ul><li>The big win is out-of-the-box performance </li></ul><ul><ul><li>Same results could be achieved with much developer time </li></ul></ul><ul><ul><li>Hides details on different platforms (esp. PS3) ‏ </li></ul></ul>Skinning Objects Morphing Objects 42fps 12fps 62fps 38fps Before After
    25. 25. Example CPU Utilization, Morphing Before After
    26. 26. Thread profiling, Morphing Before <ul><li>Some parallelization through hand-coded parallel update </li></ul><ul><ul><li>Note high overhead and 85% or so in serial execution </li></ul></ul>
    27. 27. Thread profiling, Morphing After <ul><li>Automatic parallelism in engine </li></ul><ul><ul><li>4 threads for Floodgate (4 CPUs) ‏ </li></ul></ul><ul><ul><li>Roughly, 50% of old serial time replaced with 4x parallelism </li></ul></ul>
    28. 28. New Issues <ul><li>Within the engine, resource usage peaks at certain times </li></ul><ul><ul><li>e.g. Between visibility culling and rendering </li></ul></ul><ul><ul><li>Application-level work might fill in the empty spaces </li></ul></ul><ul><ul><ul><li>Physics, global illumination, ... </li></ul></ul></ul><ul><li>What about single processor machines? </li></ul><ul><li>What about variable sized output? </li></ul><ul><ul><li>Instance culling, for example </li></ul></ul>
    29. 29. Ongoing Improvements <ul><li>Improved workflow scheduling </li></ul><ul><ul><li>Mechanisms to enhance application control </li></ul></ul><ul><li>Optimizing when tasks change </li></ul><ul><ul><li>Stream lengths change </li></ul></ul><ul><ul><li>Inputs/outputs are changed </li></ul></ul><ul><li>More platform specific improvements </li></ul><ul><li>Off-loading more engine work </li></ul>
    30. 30. Using Floodgate in a game <ul><li>Identify stream processing opportunities </li></ul><ul><ul><li>Places where lots of data is processed with local access patterns </li></ul></ul><ul><ul><li>Places where work can be prepared early but results are not needed until later </li></ul></ul><ul><li>Re-factor to use Floodgate </li></ul><ul><ul><li>Depending on task, could be as little as a few hours. </li></ul></ul><ul><ul><li>Hard part is enforcing locality </li></ul></ul>
    31. 31. Future proofed? <ul><li>Both CPUs and GPUs can function as stream processors </li></ul><ul><li>Easily extends to more processing units </li></ul><ul><li>Potential snags are in application changes </li></ul>
    32. 32. Questions? <ul><li>Ask Stephen! </li></ul><ul><li>Visit Emergent's booth at the show. </li></ul><ul><ul><li>Booth 5716, North Hall, opposite Intel on the central aisle </li></ul></ul>