Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

568 views

Published on

This talk focuses on the newest release in RenderMan* 22.5 and its adoption at Pixar Animation Studios* for rendering future movies. With native support for Intel® Advanced Vector Extensions, Intel® Advanced Vector Extensions 2, and Intel® Advanced Vector Extensions 512, it includes enhanced library features, debugging support, and an extensive test framework.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

  1. 1. FROMRENDERMAN22.0®tonext-genrendermanXPU andbeyond:RoleofOPENshadinglanguage(OSL) withIntel®Advancedvectorextensions (Intel®AVX-512) Presenters: Steena Monteiro (Intel) and Max Liani (Pixar Animation Studios) Contributors: Alex M. Wells (Intel), Steena Monteiro (Intel), Louis Feng (Intel), Max Liani (Pixar Animation Studios), Stephen Friedman (Pixar Animation Studios), Larry Gritz (Sony Pictures Imageworks)
  2. 2. • This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. • Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks • Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Intel, Xeon and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others • © Intel Corporation. Legal Disclaimers and Optimization Notices 2
  3. 3. Shading in Physically Based Rendering 3 Image credit Sony Pictures Imageworks
  4. 4. Shading Network • Multiple reusable shading nodes • Connect nodes to define complex materials • Production shading networks can grow very large to 100s, 1000s of nodes. 4
  5. 5. C++ Shader Limitations • Lack of context at compile time • Input parameters unknown • Geometry being shaded unknown • Mode of shading unknown • Surrounding shading network unknown • Branchy testing required • Lack of portability • Requires “Performance Ninjas” Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez) 5
  6. 6. Open Shading Language • Developed by Sony Pictures Imageworks* • C-like DSL for programmable shading • API to connect shaders into networks • Open source • http://github.com/imageworks/OpenShadingLanguage • Sci-Tech Award* in 2017 Logo owned by Academy of Motion Picture Arts and Sciences for Infobox *Other names and brands may be claimed as the property of others. 6
  7. 7. Poster images (c) Sony Pictures*, Paramount*, Warner Brothers*, Disney*, Fox*, Universal* 7
  8. 8. Example OSL Shader shader marble (color Cin = .5, float freq = 1.0, output color Cout = 0) { float sum = 0; float freqVal = freq; point Pshad = transform ("object", P); for (int i = 0; i < 6; i++) { sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ; freqVal = 2 * freqVal; } Cout = Cin * sum; } Shader Globals (input set by renderer) Library Calls 8
  9. 9. Motivation for SIMD Open Shading Language In its native form, OSL is unable to leverage Intel® Advanced Vector Extensions (Intel® AVX- 512) on Intel® Xeon® Intel has been leading the re-architecture of OSL since 2016 Image © Disney/Pixar 9 *Other names and brands may be claimed as the property of others.
  10. 10. oslc Offline compiler Shader Written in OSL Intermediate OSO (Instructions + operands) Renderer (Pixar’s RenderMan*, Autodesk Arnold*, Blender*) Scene Management Ray Tracing/Path Tracing Light Integration OSL Runtime Build Shading Network callbacks Execute Shading Network (per Point) Optimized x86-64 QueryOutputs *Other names and brands may be claimed as the property of others. Render Time Optimization With LLVM* JIT (Just In Time Compilation) Pre- compiled library functions OSL Framework
  11. 11. Renderer Shading System execute(ShaderGlobals,…) symbol_address(…) execute_batch(ShaderGlobalsBatch, …) Wide<T>(symbol_address) Submit Single Point Query Results Submit Batch of Points Query Batch of Results ShaderGlobalsBatch Uniform: context *’s Raytype … Queue of Varying: Surface Position Incident Ray Surface Normal … ShaderGlobals New “Batched” Interface SIMD OSL’s Batched Interface 11
  12. 12. Renderer (Pixar’s RenderMan*, Autodesk Arnold*, Blender*) Scene Management Ray Tracing/Path Tracing Light Integration SIMD OSL Runtime callbacks Execute Shading Network (per Point) Optimized Intel® AVX-512, AVX2, or AVX QueryOutputs *Other names and brands may be claimed as the property of others. Render Time Optimization With LLVM* Wide JIT (Just In Time Compilation) Pre-compiled library functions Intel® AVX- 512 SIMD OSL Framework Pre-compiled library functions Intel® AVX2 Pre-compiled library functions Intel® AVX 12
  13. 13. Components in SIMD OSL Render-time Optimized x86-64 Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 13 *Other names and brands may be claimed as the property of others.
  14. 14. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Accessors transparent AOS view of SOA SIMD OSL’s Wide Library 14
  15. 15. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Accessors transparent AOS view of SOA Extract data from a lane of the SOA SIMD OSL’s Wide Library 15
  16. 16. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Array subscript returns a proxy object to that lane Accessors transparent AOS view of SOA Extract data from a lane of the SOA SIMD OSL’s Wide Library 16
  17. 17. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Array subscript returns a proxy object to that lane Accessors transparent AOS view of SOA Extract data from a lane of the SOA Skips assignment if lane masked off SIMD OSL’s Wide Library 17
  18. 18. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flows Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 18 *Other names and brands may be claimed as the property of others.
  19. 19. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks Effective mask (result of combining stack) Divergent Control Flows 19
  20. 20. Stack of masks PUSH Effective mask (result of combining stack) if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Divergent Control Flows 20
  21. 21. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks PUSH Effective mask (result of combining stack) Divergent Control Flows 21
  22. 22. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks PUSH Effective mask (result of combining stack) Divergent Control Flows 22
  23. 23. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 23
  24. 24. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x NEGATE Stack of masks Effective mask (result of combining stack) PUSH Divergent Control Flows 24
  25. 25. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 25
  26. 26. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 26
  27. 27. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective of mask (result of combining stack) Divergent Control Flows 27
  28. 28. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flow Vectorized IR Generation Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 28 *Other names and brands may be claimed as the property of others.
  29. 29. General LLVM Code Flow for OSL Operations OSL Retrieve symbols for Operands Emit LLVM-defined operations OR Call appropriate functions Store Result 29
  30. 30. What changes in SIMD OSL OSL Retrieve symbols for Operands Load values Initialize values Emit LLVM-defined operations OR Call appropriate functions Store Result 30 OperandsàUniform ResultsàUniform OperandsàUniform ResultsàVarying OperandsàVarying ResultsàUniform OperandsàVarying ResultsàVarying
  31. 31. What changes in SIMD OSL 31 SIMD OSL Retrieve symbols for Operands Call uniform function Store Result OperandsàUniform ResultsàUniform
  32. 32. What changes in SIMD OSL 32 SIMD OSL Retrieve symbols for Operands Call uniform function Widen Result Store Result OperandsàUniform ResultsàVarying
  33. 33. What changes in SIMD OSL 33 SIMD OSL Retrieve symbols for Operands Add effective mask to arguments Call varying function Add address for Results to arguments OperandsàVarying ResultsàVarying
  34. 34. What changes in SIMD OSL 34 SIMD OSL Retrieve symbols for Operands Add effective mask to all arguments Call varying function Add address for Results to arguments Allocate a varying temp Widen uniform Operands and store to varying temp OperandsàUniform, and Varying ResultsàVarying
  35. 35. What changes in SIMD OSL 35 Unreachable OperandsàVarying ResultsàUniform
  36. 36. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flow Vectorized IR Generation “For-each- unique” algorithm Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 36 *Other names and brands may be claimed as the property of others.
  37. 37. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 For-Each-Unique Algorithm if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode ); 37
  38. 38. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm 38 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  39. 39. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm Full flexibility BatchedRendererServices 1st Pass texture(“r.tex”,”mirror”,…); 39 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  40. 40. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm Full flexibility BatchedRendererServices 1st Pass texture(“r.tex”,”mirror”,…); 2nd Pass texture(“g.tex”,”clamp”,…); 40 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  41. 41. Components in SIMD OSL Render-time Optimized x86 Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flows Vectorized IR Generation “For-each- unique” algorithm SIMD OSL built-ins 41 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> *Other names and brands may be claimed as the property of others.
  42. 42. 42 Scalar computation with Scalar data types Block Vectorization with intrinsics template<int WidthT> void operator() (MaskedAccessor<float, WidthT> wresult, ConstWideAccessor<Vec3, WidthT> wp) const { #pragma forceinline recursive { #pragma omp simd simdlen(WidthT) for(int l=0; l< WidthT; ++l) { Vec3 p = wp[l]; float perlinResult; HashScalar h; perlin_scalar(perlinResult, h, p.x, p.y, p.z); float scaledResult = 0.5f * (perlinResult + 1.0f); wresult[l] = scaledResult; } } } inline void operator() (float &result, const Vec3 &p) const { HashScalar h; perlin(result, h, p.x, p.y, p.z); result = 0.5f * (result + 1.0f); } Explicit Outer Loop Vectorization (Intel® C++ Compiler) (Clang 5+) SIMD OSL’s Perlin Noise
  43. 43. OSL Microbenchmarks: Speedup of SIMD AVX-512 OSL over Scalar OSL 0.125 0.25 0.5 1 2 4 8 16 null sin cos tan asin acos atan sinh cosh tanh atan2 sincos log log2 log10 logb exp exp2 expm1 pow erf erfc radians degrees sqrt inversesqrt hypot abs fabs sign floor ceil roundtruncmod min maxclampmix isnan isfinite select dot cross length distance normalize reflect fresnel rotate transform transform_matrix matrix_object_camera determinant transpose linearstep smooth_linearstep noise_perlin noise_cell noise_simplex noise_gabor pnoise_perlin pnoise_cell pnoise_gabor spline_bezier spline_bspline spline_catmull-rom spline_hermitespline_linearspline_constant 48 threads on Intel(R) Xeon(R) Platinum 8260L CPU @2.30GHz (config 2) Average: 6.9x Geomean: 6.14x 43 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  44. 44. OSL SIMD Performance at Maximum Batch Utilization OSL’s testshade running Intel® AVX-512® on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 leopard concrete diamond oak marble Speedupatmaxbatchsize 5.2x 6x 10x 12x 15x 44 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  45. 45. SIMD OSL Intel® AVX-512 VS AVX2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 leopard concrete diamond plate oak marble thread donut Speedup 1.6x 1.9x 1.1x OSL’s testshade running Intel® AVX-512 and AVX2 on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 1.3x 1.3x 1.4x 1.8x 45 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  46. 46. Evolution of SIMD OSL—Proof of Concept to Production 2016‒2019 SIMD OSL Library SIMD OSL Framework SIMD OSL Performance Intel® AVX-512, AVX2, AVX-specific libraries Masking and scatter- gather 17k+ tests Improved performance on built-in functions Compiler + platform support Reduction in JIT time Coverage for built-in function variants Handling treacherous control flows Noise functions with options LLVM optimization passes to improve AVX2 46
  47. 47. SIMD Open Shading Language Open Shading Language https://github.com/imageworks/OpenShadingLanguage https://gitlab.com/intel-osl/BatchedOSL 47
  48. 48. This Page Intentionally Left Blank 48
  49. 49. Intel® AVX-512 Performance Vs Batch Utilization marble oak diamond concrete leopard 0 5 10 15 batch 1 batch 2 batch 3 batch 4 batch 5 batch 6 batch 7 batch 8 batch 9 batch 10 batch 11 batch 12 batch 13 batch 14 batch 15 batch 16 Speedupfrombatching Performance gain with increased batch utilization 15x 12x 10x 6x 5.2x OSL’s testshade running Intel® AVX-512® on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 49 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  50. 50. 22.4 Shading Speedup with SIMD OSL 50 1 1.2 1.4 1.6 1.8 2 2.2 Bonnie’s room Fillmore Bonnie Speedup CLX8260L (24c, 2.3GHz) 1.26x 1.37x 2.06x Image © Disney/Pixar Image © Disney/Pixar Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  51. 51. 22.4’s Overall Rendering Speedup with SIMD OSL 51 1 1.05 1.1 1.15 1.2 1.25 1.3 Bonnie’s room Fillmore Bonnie Speedup CLX8260L (24c, 2.3GHz) 1.11x 1.17x 1.27x *Other names and brands may be claimed as the property of others. Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  52. 52. Bonnie • Real production character with 55 shader networks • 85663 shader operations on 67680 symbols (post-optimization) Image © Disney/Pixar *Other names and brands may be claimed as the property of others. 52 Single Point Batched Amdahl’s Law 66.64% Batch Utilization 2.05x Shading Speedup Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  53. 53. Performance Progression 3 factors at play: ● Efficiency of the generated vectorized shader code ● Effective vectorization of the shading interface ● How effective is the renderer in taking advantage of the vectorized shading language 53 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  54. 54. Efficiency in the shading language Most effort up to now on the quality of the shader code generation ● Masked control flow for vectorized execution ● Optimization of noises and math functions ● Optimization of texture calls. 54Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  55. 55. Efficiency in the Shading API 55 The shading language calls into the renderer ● To access data, primvars, tranforms, etc… ● To compute things, texture interpolation, trace rays, etc… ● To return values ● All of the above is nicely vectorized (batched) ● We call across the API boundaries fewer times Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  56. 56. Efficiency in the Renderer 56 We started with a vectorized renderer ● RIS is one of the few vectorized renderers in the industry that works on ray batches ● It turns out that our batch granularity is not enabling effective vectorization ● Results we see today are a fraction of the benefit we would get. Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  57. 57. Efficiency in the Renderer What is efficient? ● Portions of the renderer where execution is coherent ● Displacement shading ● Camera rays hits What is inefficient? ● Indirect illumination ● Deep bounces 57 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  58. 58. Efficiency in the Renderer 58 *Other names and brands may be claimed as the property of others. 1 point 2 points 3 points 4 points 5 points 6 points 7 points 8 points 9 points 10 points 11 points 12 points 13 points 14 points 15 points 16 points 0 10 20 30 40 50 60 70 80 1 Bounce 2 Bounces 3 Bounces 5 Bounces 9 Bounces 7.3% 13.9% 18.9% 22.3% 25.4% 76.6% 67.1% 60.9% 56.5% 52.6% %ofBatchesSubmitted Pixar’s RenderMan* 22.dev running on all 40 threads of Intel® Xeon® Gold 6148 @2.4Ghz (config 4) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  59. 59. Efficiency in the Renderer How do we currently accomodate for low occupancy? ● We switch over single point evaluation for small batches. ● We use some heuristic to determine when to switch. ● A threshold point of 4 active lanes tends to be a decent starting point. ● This may change as more optimizations are done ● However it would be best to guarantee high SIMD occupancy 59 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  60. 60. Towards a new Rendering Architecture Batches are currently determined by the size of bucket rendering ● Computational workload is uneven throughout the image ● Larger buckets gives more points, higher occupancy ● Larger buckets means one thread may be stuck rendering a single heavy buckets for long time, reducing thread scaling ● Decent bucket size for good thread load balancing is 8x8 or 16x16. ● This is a batch size of 64-256. ● We would need 2k-8k batch size at least. 60 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  61. 61. Different options at hand ● Wavefront rendering ● Shading queues ● Non image-space decomposition scheduling ● The new architecture in being implemented in Pixar’s Renderman® XPU ● Stay tuned 61 Towards a new Rendering Architecture Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  62. 62. OSL Shaders • Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl • Modifications: • Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl • Diamond plate - https://github.com/varkenvarken/osl- shaders/blob/master/Shaders/diamondplateshader.osl • Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN- Experimental/Threads.osl • Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN- Experimental/TheDonutShader.osl • Oak – https://renderman.pixar.com/forum/download.php • Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl • Marble - https://renderman.pixar.com/forum/download.php • Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl < float grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0 ,0)); --- > float grain=noise("gabor",p,8); *Other names and brands may be claimed as the property of others. 62
  63. 63. 63 Config 1 Config 2 Config 3 Config 4 Model name Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Core(s) per socket24 24 18 20 Socket(s)2 2 2 2 Memory192GB, DDR4-2933 Mhz (12 x 16GB) 192GB, DDR4-2933 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB) 192GB, DDR4-2666 Mhz (12 x 16GB) CPU Power PolicyPerformance Performance Performance Powersave HyperthreadingDisabled Enabled Enabled Enabled Turbo Boost TechEnabled Enabled Enabled Enabled L1d cache32K 32K 32K 32K L1i cache32K 32K 32K 32K L2 cache1024K 1024K 256K 1024K L3 cache36608K 33792K 46080K 28160K Operating SystemFedora release 27 (Twenty Seven) CentOS Linux release 7.6.1810 (Core) Red Hat Enterprise Linux Server release 7.2 (Maipo) CentOS Linux release 7.3.1611 (Core) Bios Version SE5C620.86B.0D.01.0286.0111201908 16 SE5C620.86B.0D.01.0395.022720191 340 GRRFSDP1.86B0271.R00.1510301446 SE5C620.86B.01.00.0412.020920172159 Configurations
  64. 64. • Subtitle Copy Goes Here

×