Threading Successes 06 Allegorithmic


Published on

Published in: Technology, Art & Photos
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Threading Successes 06 Allegorithmic

    1. 1. Allegorithmic Substance Threaded Middleware
    2. 2. Procedural textures on multi-core <ul><li>Other than framerate and features, what else can you do with extra CPU power ? </li></ul><ul><li>We’ll look at Allegorithmic’s middleware, Substance </li></ul>
    3. 3. Procedural textures are valuable for modern games <ul><li>Have a LOT of textures. </li></ul><ul><li>Want shorter loading times ‏‏ (faster starts, teleportations or zooms) ‏ . </li></ul><ul><li>Need to reduce texture memory on a disc, for download, and/or in RAM. </li></ul><ul><li>Can benefit from more flexible and reusable assets. </li></ul>
    4. 4. Introducing Substance <ul><li>In Q2 2007 Allegorithmic started a complete reengineering of ProFX2, authoring tool and engine, named Substance. </li></ul><ul><li>Unit tests were done very early to ensure that Substance could target streaming. </li></ul><ul><li>Cross-platform : PC, PS3, XBOX, etc. </li></ul><ul><li>Expected linear multi-thread scalability. </li></ul>
    5. 5. What is Substance ? <ul><li>Substance is a middleware product composed of two elements. </li></ul><ul><li>Substance Authoring Tool lets you </li></ul><ul><ul><li>create procedural textures </li></ul></ul><ul><ul><li>create texture packages of a few kilobytes ! </li></ul></ul><ul><ul><li>A cooker compiles generic data into binaries optimized for a specific platform or user. </li></ul></ul><ul><li>Substance Engine </li></ul><ul><ul><li>generates bitmap textures on the fly. </li></ul></ul>
    6. 6. Less FPS ? <ul><li>More textures, not less FPS </li></ul><ul><ul><li>Substance consumes idle cycles, not frames </li></ul></ul><ul><li>Graphics bitrates follow Moore's law </li></ul><ul><ul><li>Higher poly count -> bigger worlds </li></ul></ul><ul><ul><li>Higher filter rate -> larger textures </li></ul></ul><ul><ul><li>Desired texture volume grows faster than RAM </li></ul></ul><ul><li>Streaming is a necessity </li></ul><ul><ul><li>But HDD net bitrate does not follow. Bottleneck ! </li></ul></ul><ul><li>Modern gameplay entails sudden bitrate bursts </li></ul><ul><ul><li>This is worsened by HDD seeks and entails stalls. </li></ul></ul>
    7. 7. No, a stable and high FPS. <ul><li>Even masked, a stall is actually a FPS drop </li></ul><ul><li>Substance works in Random Access Memory </li></ul><ul><li>The gamer zooms or teleports: </li></ul><ul><ul><li>Give 4 cores and a GPU to Substance </li></ul></ul><ul><ul><li>Sacrifice 1 or 2 frames </li></ul></ul><ul><ul><li>Substance gen. & cache 1-2M new texels. </li></ul></ul><ul><ul><li>The stall does not hinder game play. </li></ul></ul><ul><li>Substance diminishes stalls </li></ul><ul><li>Substance helps to maintain a high FPS. </li></ul>
    8. 8. Performance issue: streaming in games <ul><li>DVD or HDD net bitrate is 2 or 6 MB/s </li></ul><ul><li>Our aim: add a stable 4MB/s without the GPU </li></ul><ul><li>Requires billions of intermediate pixels/s. </li></ul><ul><li>Can CPUs compete with GPUs ? </li></ul><ul><li>Opportunity: cores are still under-exploited in most game engines. </li></ul><ul><li>Texture processing is privileged in the new multi-core architectures. </li></ul>
    9. 9. The architecture was designed with these issues in mind: <ul><li>Homogeneous CPU and GPU versions </li></ul><ul><li>Streaming (~1-10 CPU cycles per pixel) ‏ </li></ul><ul><li>SIMD & MT for the multi-core generations </li></ul><ul><li>No cache nor threading pollution </li></ul><ul><li>Fine grained jobs and lockless sync. </li></ul><ul><li>Low memory footprint </li></ul>
    10. 10. The theoretical benefit was calculated <ul><li>New architectures come with enhanced SIMD. Expected x10 compared to std C++ </li></ul><ul><li>Tricks and algorithmic changes could give another x10 on some filters, like DXT </li></ul><ul><li>We were confident that our image processes could be well threaded. Partly because we generate textures asynchronously </li></ul><ul><li>Hence the CPU version of ProFX2 could be accelerated by a factor x25-x100 </li></ul>
    11. 11. This is the approach taken to address the issue: <ul><li>Simple innerloop tests actually showed that optimized SSE2-4 code could give a boost of x10 </li></ul><ul><li>Find a data layout coherent with micro parallelism (SIMD and pipeline), low level threading, cache and memory handling. </li></ul><ul><li>OpenMP is then used to test strategies before designing a specific MT HAL </li></ul>
    12. 12. Here’s the code that was developed to make this possible: <ul><li>A SIMD HAL is ready for PC, Xbox, PS3. </li></ul><ul><li>OpenMP easily gives a 85% MT linearity. </li></ul><ul><li>Our MT HAL is converging towards a model of lockless synchronization, 95% expected. </li></ul><ul><li>The cooker precomputes data that will help synchronization and MT efficiency. </li></ul><ul><li>Our API exposes asynchronous commands. Perfect to share cores with a game loop ! </li></ul>
    13. 13. The compositing graph, node based image processing <ul><li>Authoring Tool: non linear editing </li></ul><ul><li>Engine: efficient high level structure </li></ul><ul><li>Graph (DAG) contains 3 types of nodes: </li></ul><ul><ul><li>Sources: procedural noise, bitmaps, SVGs </li></ul></ul><ul><ul><li>Filters: blend, HSL, TRS, warp, blur, etc. </li></ul></ul><ul><ul><li>Outputs: coherent diffuse & normal maps, etc. </li></ul></ul><ul><li>Main advantages: </li></ul><ul><ul><li>Libraries, capsules: instanciation of subgraphs </li></ul></ul><ul><ul><li>Complex variants: fast to create and compute </li></ul></ul><ul><ul><li>Dynamic custom branches (ex: aging textures) ‏ </li></ul></ul>
    14. 14. The compositing graph, node based image processing
    15. 15. Threading strategies <ul><li>High level threading: </li></ul><ul><ul><li>Task decomposition : 1 node (filter) per thread </li></ul></ul><ul><ul><li>Graph splitting ensures task independency </li></ul></ul><ul><li>Low level threading: </li></ul><ul><ul><li>Data decomposition : 1 strip of blocks per thread </li></ul></ul><ul><ul><li>Dispatcher ensures non conflicting areas </li></ul></ul><ul><ul><li>Pixel to pixel filters are concatenated. </li></ul></ul><ul><ul><li>Streamed R/W, no L2 cache pollution </li></ul></ul><ul><ul><li>Temporary blocks in private L1 double buffers </li></ul></ul><ul><ul><li>Intermediate images never allocated </li></ul></ul><ul><ul><li>Lockless reactive sync and cache friendly </li></ul></ul>
    16. 16. Threading sub graphs (1/11) by nodes (high level) ‏
    17. 17. Threading sub graphs (2/11) by nodes , caching
    18. 18. Threading sub graphs (3/11) by nodes
    19. 19. Threading sub graphs (4/11) by strips (low level) ‏
    20. 20. Threading sub graphs (5/11) remove from cache
    21. 21. Threading sub graphs (6/11) by strips
    22. 22. Threading sub graphs (7/11) remove from cache
    23. 23. Threading sub graphs (8/11) by strips
    24. 24. Threading sub graphs (9/11) remove from cache
    25. 25. Threading sub graphs (10/11) by strips
    26. 26. Threading sub graphs (11/11) update cache , and finished
    27. 27. Expect more streaming bandwidth <ul><li>Substance generates 4MB/s of compressed textures per second </li></ul><ul><li>Cumulate this with classical streaming </li></ul><ul><li>50+ MB/s loading with 4 cores and 1 GPU </li></ul>
    28. 28. Here’s how close we got to the theoretical best performance: <ul><li>DXT compression at 2G pixels/s (same as what hi-end GPUs can do in 2007). </li></ul><ul><li>8 bits SVG (cooked) rendering at 20G/s. 8G/s anti-aliasing with 4 sub-samples. </li></ul><ul><li>In most cases 4 cores give a x3.8 boost </li></ul><ul><li>Some filters are more problematic, but solutions have been imagined in details, and will be implemented between Q2 and Q4 2008. </li></ul>
    29. 29. Here’s the new performance profile: <ul><li>Substance and ProFX2 figures are for one core. </li></ul><ul><li>4 cores: 3.8 times more fillrate. </li></ul><ul><li>ProFX2: SVG GPU </li></ul><ul><li>Substance: SVG CPU </li></ul><ul><li>SVG AA: 2G pixels/s per core </li></ul>
    30. 30. This is future-proofed <ul><li>The cooker precomputes whatever helps to linearise computations. </li></ul><ul><li>Scalable code: SSE4 added in one day thanks to the SIMD HAL </li></ul><ul><li>Scalable threading: our two strategies scale </li></ul><ul><li>A few functions dispatch virtual CPU &quot;shaders&quot; </li></ul><ul><li>64-cores ready ↔ code a new dispatcher ? </li></ul><ul><li>Multiplatform design. </li></ul>
    31. 31. What’s next?
    32. 32. Procedural diffuse map
    33. 33. Coherent procedural normal map
    34. 34. Complex procedural environment map
    35. 35. This scene is made entirely of procedural textures
    36. 36. Future sources of bandwidth <ul><li>SIMD code can be better pipelined in ASM. </li></ul><ul><li>Our cooker can optimize a lot of things. </li></ul><ul><li>Authoring tool will have a RT profiler </li></ul><ul><li>Artists gaining experience with Substance will also optimize their packages better. </li></ul><ul><li>Artist feedback will also help us to improve the expressiveness of each filter </li></ul><ul><li>~30-50 filters per texture, main perf. divisor. </li></ul>
    37. 37. Here’s how you can best take advantage of procedural textures <ul><li>Anticipate texture generation requests. </li></ul><ul><li>Predict visibility (HOM, PVS) ‏ . </li></ul><ul><li>Create mipmaps. Access levels JIT. </li></ul><ul><li>Cache the useful texels. </li></ul><ul><li>Adapt texture resolution to workload. </li></ul><ul><li>Use texture variants, less tiling textures or details. Show a higher texel/pixel ratio. </li></ul>
    38. 38. What do you think? <ul><li>Have you tried something like this? </li></ul><ul><li>Have you rejected trying something like this? </li></ul>