SlideShare a Scribd company logo
1 of 64
Download to read offline
FROMRENDERMAN22.0®tonext-genrendermanXPU
andbeyond:RoleofOPENshadinglanguage(OSL)
withIntel®Advancedvectorextensions
(Intel®AVX-512) Presenters: Steena Monteiro (Intel) and Max Liani (Pixar
Animation Studios)
Contributors: Alex M. Wells (Intel), Steena Monteiro (Intel),
Louis Feng (Intel),
Max Liani (Pixar Animation Studios), Stephen Friedman (Pixar
Animation Studios),
Larry Gritz (Sony Pictures Imageworks)
• This document contains information on products, services and/or processes in development. All information provided here is subject to change without
notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
• Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
intel.com.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks
•
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes.
Any differences in your system hardware, software or configuration may affect your actual performance.
• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm
whether referenced data are accurate.
• Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer
to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
• Intel, Xeon and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others
• © Intel Corporation.
Legal Disclaimers and Optimization Notices
2
Shading in Physically Based Rendering
3
Image credit Sony Pictures Imageworks
Shading Network
• Multiple reusable shading
nodes
• Connect nodes to define
complex materials
• Production shading
networks can grow very
large to 100s, 1000s of
nodes.
4
C++ Shader Limitations
• Lack of context at compile time
• Input parameters unknown
• Geometry being shaded
unknown
• Mode of shading unknown
• Surrounding shading
network unknown
• Branchy testing required
• Lack of portability
• Requires “Performance Ninjas”
Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez)
5
Open Shading
Language
• Developed by Sony Pictures Imageworks*
• C-like DSL for programmable shading
• API to connect shaders into networks
• Open source
• http://github.com/imageworks/OpenShadingLanguage
• Sci-Tech Award* in 2017
Logo owned by Academy of Motion Picture Arts and Sciences for Infobox
*Other names and brands may be claimed as the property of others.
6
Poster images (c) Sony Pictures*, Paramount*, Warner
Brothers*, Disney*, Fox*, Universal*
7
Example OSL Shader
shader marble (color Cin = .5,
float freq = 1.0,
output color Cout = 0)
{
float sum = 0;
float freqVal = freq;
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
Shader
Globals
(input set by renderer)
Library Calls
8
Motivation for SIMD Open Shading
Language
In its native form, OSL is
unable to leverage Intel®
Advanced Vector
Extensions (Intel® AVX-
512) on Intel® Xeon®
Intel has been leading the
re-architecture of OSL
since 2016
Image © Disney/Pixar
9
*Other names and brands may be claimed as the property of others.
oslc
Offline
compiler
Shader
Written in OSL
Intermediate OSO
(Instructions + operands)
Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
OSL Runtime
Build
Shading
Network
callbacks
Execute
Shading
Network
(per Point)
Optimized
x86-64
QueryOutputs
*Other names and brands may be claimed as the property of others.
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Pre-
compiled
library
functions
OSL Framework
Renderer Shading System
execute(ShaderGlobals,…)
symbol_address(…)
execute_batch(ShaderGlobalsBatch, …)
Wide<T>(symbol_address)
Submit Single Point
Query Results
Submit Batch
of Points
Query Batch of
Results
ShaderGlobalsBatch
Uniform:
context *’s
Raytype
…
Queue of Varying:
Surface Position
Incident Ray
Surface Normal
…
ShaderGlobals
New “Batched” Interface
SIMD OSL’s Batched Interface
11
Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
SIMD OSL Runtime
callbacks
Execute
Shading
Network
(per Point)
Optimized Intel®
AVX-512, AVX2,
or AVX
QueryOutputs
*Other names and brands may be claimed as the property of others.
Render Time
Optimization
With
LLVM* Wide JIT
(Just In Time Compilation)
Pre-compiled
library
functions
Intel® AVX-
512
SIMD OSL Framework
Pre-compiled
library
functions
Intel® AVX2
Pre-compiled
library
functions
Intel® AVX
12
Components in
SIMD OSL Render-time
Optimized x86-64
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Wide Library
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
13
*Other names and brands may be claimed as the property of others.
my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int
mask_value)
{
Mask mask (mask_value);
ASSERT(mask.any_on());
Wide<const float> wScale (wS);
Wide<const Vec3> wVec (wVec);
Wide<const Matrix44> wMat (wM);
Masked<Vec3> wVT_result (wVT, mask);
Masked<Vec3> wVS_result (wVS, mask);
for(int lane = 0; lane < __OSL_WIDTH; ++lane) {
Vec3 V = wVec[lane];
Float F = wScale[lane];
Matrix M = wMat[lane];
wVS_result[lane] = V*F;
wVT_result[lane] = transform(M,V);
}
}
Accessors
transparent
AOS view of SOA
SIMD OSL’s Wide Library
14
my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int
mask_value)
{
Mask mask (mask_value);
ASSERT(mask.any_on());
Wide<const float> wScale (wS);
Wide<const Vec3> wVec (wVec);
Wide<const Matrix44> wMat (wM);
Masked<Vec3> wVT_result (wVT, mask);
Masked<Vec3> wVS_result (wVS, mask);
for(int lane = 0; lane < __OSL_WIDTH; ++lane) {
Vec3 V = wVec[lane];
Float F = wScale[lane];
Matrix M = wMat[lane];
wVS_result[lane] = V*F;
wVT_result[lane] = transform(M,V);
}
}
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
SIMD OSL’s Wide Library
15
my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int
mask_value)
{
Mask mask (mask_value);
ASSERT(mask.any_on());
Wide<const float> wScale (wS);
Wide<const Vec3> wVec (wVec);
Wide<const Matrix44> wMat (wM);
Masked<Vec3> wVT_result (wVT, mask);
Masked<Vec3> wVS_result (wVS, mask);
for(int lane = 0; lane < __OSL_WIDTH; ++lane) {
Vec3 V = wVec[lane];
Float F = wScale[lane];
Matrix M = wMat[lane];
wVS_result[lane] = V*F;
wVT_result[lane] = transform(M,V);
}
}
Array subscript returns a
proxy object to that lane
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
SIMD OSL’s Wide Library
16
my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int
mask_value)
{
Mask mask (mask_value);
ASSERT(mask.any_on());
Wide<const float> wScale (wS);
Wide<const Vec3> wVec (wVec);
Wide<const Matrix44> wMat (wM);
Masked<Vec3> wVT_result (wVT, mask);
Masked<Vec3> wVS_result (wVS, mask);
for(int lane = 0; lane < __OSL_WIDTH; ++lane) {
Vec3 V = wVec[lane];
Float F = wScale[lane];
Matrix M = wMat[lane];
wVS_result[lane] = V*F;
wVT_result[lane] = transform(M,V);
}
}
Array subscript returns a
proxy object to that lane
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
Skips assignment if lane masked off
SIMD OSL’s Wide Library
17
Components in
SIMD OSL Render-time
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Wide Library
Divergent
Control Flows
Optimized x86-64
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
18
*Other names and brands may be claimed as the property of others.
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
Effective mask
(result of combining stack)
Divergent Control Flows
19
Stack of masks
PUSH
Effective mask
(result of combining stack)
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Divergent Control Flows
20
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
PUSH
Effective mask
(result of combining stack)
Divergent Control Flows
21
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
PUSH
Effective mask
(result of combining stack)
Divergent Control Flows
22
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
(result of combining stack)
Divergent Control Flows
23
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
NEGATE
Stack of masks
Effective mask
(result of combining stack)
PUSH
Divergent Control Flows
24
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
(result of combining stack)
Divergent Control Flows
25
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
(result of combining stack)
Divergent Control Flows
26
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective of mask
(result of combining stack)
Divergent Control Flows
27
Components in
SIMD OSL Render-time
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Wide Library
Divergent
Control Flow
Vectorized IR
Generation
Optimized x86-64
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
28
*Other names and brands may be claimed as the property of others.
General LLVM Code Flow for
OSL Operations
OSL
Retrieve symbols for
Operands
Emit LLVM-defined operations
OR
Call appropriate functions
Store Result
29
What changes in SIMD OSL
OSL
Retrieve symbols for
Operands
Load values
Initialize values
Emit LLVM-defined operations
OR
Call appropriate functions
Store Result
30
OperandsàUniform
ResultsàUniform
OperandsàUniform
ResultsàVarying
OperandsàVarying
ResultsàUniform
OperandsàVarying
ResultsàVarying
What changes in SIMD OSL
31
SIMD OSL
Retrieve symbols for
Operands
Call uniform
function
Store Result
OperandsàUniform
ResultsàUniform
What changes in SIMD OSL
32
SIMD OSL
Retrieve symbols for
Operands
Call uniform
function
Widen Result
Store Result
OperandsàUniform
ResultsàVarying
What changes in SIMD OSL
33
SIMD OSL
Retrieve symbols for
Operands
Add effective mask to
arguments
Call varying function
Add address for
Results to arguments
OperandsàVarying
ResultsàVarying
What changes in SIMD OSL
34
SIMD OSL
Retrieve symbols for
Operands
Add effective mask to
all arguments
Call varying function
Add address for
Results to arguments
Allocate a varying
temp
Widen uniform
Operands and store to
varying temp
OperandsàUniform,
and Varying
ResultsàVarying
What changes in SIMD OSL
35
Unreachable
OperandsàVarying
ResultsàUniform
Components in
SIMD OSL Render-time
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Wide Library
Divergent
Control Flow
Vectorized IR
Generation
“For-each-
unique”
algorithm
Optimized x86-64
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
36
*Other names and brands may be claimed as the property of others.
layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
For-Each-Unique Algorithm
if (layer == 1) file = “r.tex”;
if (layer == 2) file = “g.tex”;
if (layer == 3) file = “r.tex”;
if (layer == 4) file = “g.tex”;
wrap_mode = (layer%2==0)?“clamp”:“mirror”;
texture(file, u, v, “wrap”,wrap_mode );
37
layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
For-Each-Unique Algorithm
38
if (layer == 1) file = “r.tex”;
if (layer == 2) file = “g.tex”;
if (layer == 3) file = “r.tex”;
if (layer == 4) file = “g.tex”;
wrap_mode = (layer%2==0)?“clamp”:“mirror”;
texture(file, u, v, “wrap”,wrap_mode );
layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
For-Each-Unique Algorithm
Full flexibility
BatchedRendererServices
1st Pass
texture(“r.tex”,”mirror”,…);
39
if (layer == 1) file = “r.tex”;
if (layer == 2) file = “g.tex”;
if (layer == 3) file = “r.tex”;
if (layer == 4) file = “g.tex”;
wrap_mode = (layer%2==0)?“clamp”:“mirror”;
texture(file, u, v, “wrap”,wrap_mode );
layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
For-Each-Unique Algorithm
Full flexibility
BatchedRendererServices
1st Pass
texture(“r.tex”,”mirror”,…);
2nd Pass
texture(“g.tex”,”clamp”,…);
40
if (layer == 1) file = “r.tex”;
if (layer == 2) file = “g.tex”;
if (layer == 3) file = “r.tex”;
if (layer == 4) file = “g.tex”;
wrap_mode = (layer%2==0)?“clamp”:“mirror”;
texture(file, u, v, “wrap”,wrap_mode );
Components in
SIMD OSL Render-time
Optimized x86
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Wide Library
Divergent
Control Flows
Vectorized IR
Generation
“For-each-
unique”
algorithm
SIMD OSL
built-ins
41
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
*Other names and brands may be claimed as the property of others.
42
Scalar computation
with
Scalar data types
Block Vectorization
with intrinsics
template<int WidthT> void operator() (MaskedAccessor<float, WidthT> wresult,
ConstWideAccessor<Vec3, WidthT> wp) const {
#pragma forceinline recursive
{
#pragma omp simd simdlen(WidthT)
for(int l=0; l< WidthT; ++l) {
Vec3 p = wp[l];
float perlinResult;
HashScalar h;
perlin_scalar(perlinResult, h, p.x, p.y, p.z);
float scaledResult = 0.5f * (perlinResult + 1.0f);
wresult[l] = scaledResult;
}
}
}
inline void operator() (float &result, const Vec3 &p) const
{
HashScalar h;
perlin(result, h, p.x, p.y, p.z);
result = 0.5f * (result + 1.0f);
}
Explicit
Outer Loop
Vectorization
(Intel® C++ Compiler)
(Clang 5+)
SIMD OSL’s Perlin Noise
OSL Microbenchmarks: Speedup of
SIMD AVX-512 OSL over Scalar OSL
0.125
0.25
0.5
1
2
4
8
16
null
sin cos tan
asin
acos
atan
sinh
cosh
tanh
atan2
sincos
log
log2
log10
logb
exp
exp2
expm1
pow
erf
erfc
radians
degrees
sqrt
inversesqrt
hypot
abs
fabs
sign
floor
ceil
roundtruncmod
min
maxclampmix
isnan
isfinite
select
dot
cross
length
distance
normalize
reflect
fresnel
rotate
transform
transform_matrix
matrix_object_camera
determinant
transpose
linearstep
smooth_linearstep
noise_perlin
noise_cell
noise_simplex
noise_gabor
pnoise_perlin
pnoise_cell
pnoise_gabor
spline_bezier
spline_bspline
spline_catmull-rom
spline_hermitespline_linearspline_constant
48 threads on Intel(R) Xeon(R) Platinum 8260L CPU @2.30GHz (config 2)
Average: 6.9x
Geomean: 6.14x
43
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
OSL SIMD Performance at Maximum
Batch Utilization
OSL’s testshade running Intel® AVX-512® on 48 threads of
Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
leopard concrete diamond oak marble
Speedupatmaxbatchsize
5.2x
6x
10x
12x
15x
44
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
SIMD OSL Intel® AVX-512 VS AVX2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
leopard concrete diamond plate oak marble thread donut
Speedup
1.6x 1.9x
1.1x
OSL’s testshade running Intel® AVX-512 and AVX2 on 48 threads of
Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1)
1.3x 1.3x
1.4x
1.8x
45
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Evolution of SIMD OSL—Proof of
Concept to Production 2016‒2019
SIMD OSL
Library
SIMD OSL
Framework
SIMD OSL
Performance
Intel® AVX-512,
AVX2, AVX-specific
libraries
Masking and scatter-
gather
17k+ tests
Improved
performance on
built-in functions
Compiler + platform
support
Reduction in JIT
time
Coverage for built-in
function variants
Handling
treacherous control
flows
Noise functions
with options
LLVM optimization
passes to improve
AVX2
46
SIMD Open Shading
Language
Open Shading
Language
https://github.com/imageworks/OpenShadingLanguage
https://gitlab.com/intel-osl/BatchedOSL
47
This Page Intentionally Left Blank
48
Intel® AVX-512 Performance
Vs Batch Utilization
marble
oak
diamond
concrete
leopard
0
5
10
15
batch 1 batch 2 batch 3 batch 4 batch 5 batch 6 batch 7 batch 8 batch 9 batch 10 batch 11 batch 12 batch 13 batch 14 batch 15 batch 16
Speedupfrombatching
Performance gain with increased batch utilization
15x
12x
10x
6x
5.2x
OSL’s testshade running Intel® AVX-512® on 48 threads of
Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1)
49
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
22.4 Shading Speedup
with SIMD OSL
50
1
1.2
1.4
1.6
1.8
2
2.2
Bonnie’s room Fillmore Bonnie
Speedup
CLX8260L (24c, 2.3GHz)
1.26x
1.37x
2.06x
Image © Disney/Pixar
Image © Disney/Pixar
Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2)
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
22.4’s Overall Rendering
Speedup with SIMD OSL
51
1
1.05
1.1
1.15
1.2
1.25
1.3
Bonnie’s room Fillmore Bonnie
Speedup
CLX8260L (24c, 2.3GHz)
1.11x
1.17x
1.27x
*Other names and brands may be claimed as the property of others.
Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Bonnie
• Real production character with 55 shader networks
• 85663 shader operations on 67680 symbols (post-optimization)
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
52
Single Point Batched
Amdahl’s
Law
66.64%
Batch
Utilization
2.05x Shading
Speedup
Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Performance Progression
3 factors at play:
● Efficiency of the generated vectorized shader code
● Effective vectorization of the shading interface
● How effective is the renderer in taking advantage
of the vectorized shading language
53
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Efficiency in the shading language
Most effort up to now on the quality
of the shader code generation
● Masked control flow for
vectorized execution
● Optimization of noises and math
functions
● Optimization of texture calls.
54Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Efficiency in the Shading API
55
The shading language calls into the renderer
● To access data, primvars, tranforms, etc…
● To compute things, texture interpolation, trace rays,
etc…
● To return values
● All of the above is nicely vectorized (batched)
● We call across the API boundaries fewer times
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Efficiency in the Renderer
56
We started with a vectorized renderer
● RIS is one of the few vectorized renderers in
the industry that works on ray batches
● It turns out that our batch granularity is not
enabling effective vectorization
● Results we see today are a fraction of the
benefit we would get.
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Efficiency in the Renderer
What is efficient?
● Portions of the renderer where execution is coherent
● Displacement shading
● Camera rays hits
What is inefficient?
● Indirect illumination
● Deep bounces
57
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Efficiency in the Renderer
58
*Other names and brands may be claimed as the property of others.
1 point
2 points
3 points
4 points
5 points
6 points
7 points
8 points
9 points
10 points
11 points
12 points
13 points
14 points
15 points
16 points
0
10
20
30
40
50
60
70
80
1 Bounce 2 Bounces 3 Bounces 5 Bounces 9 Bounces
7.3%
13.9%
18.9%
22.3%
25.4%
76.6%
67.1%
60.9%
56.5%
52.6%
%ofBatchesSubmitted
Pixar’s RenderMan* 22.dev running on all 40 threads of Intel® Xeon® Gold 6148
@2.4Ghz (config 4)
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Efficiency in the Renderer
How do we currently accomodate for low occupancy?
● We switch over single point evaluation for small batches.
● We use some heuristic to determine when to switch.
● A threshold point of 4 active lanes tends to be a decent starting point.
● This may change as more optimizations are done
● However it would be best to guarantee high SIMD occupancy
59
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Towards a new Rendering Architecture
Batches are currently determined by the size of bucket rendering
● Computational workload is uneven throughout the image
● Larger buckets gives more points, higher occupancy
● Larger buckets means one thread may be stuck rendering a single heavy
buckets for long time, reducing thread scaling
● Decent bucket size for good thread load balancing is 8x8 or 16x16.
● This is a batch size of 64-256.
● We would need 2k-8k batch size at least.
60
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
Different options at hand
● Wavefront rendering
● Shading queues
● Non image-space decomposition scheduling
● The new architecture in being implemented in Pixar’s Renderman® XPU
● Stay tuned
61
Towards a new Rendering Architecture
Image © Disney/Pixar
*Other names and brands may be claimed as the property of others.
OSL Shaders
• Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl
• Modifications:
• Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl
• Diamond plate - https://github.com/varkenvarken/osl-
shaders/blob/master/Shaders/diamondplateshader.osl
• Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-
Experimental/Threads.osl
• Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-
Experimental/TheDonutShader.osl
• Oak – https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl
• Marble - https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl
< float
grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0
,0));
---
> float grain=noise("gabor",p,8);
*Other names and brands may be claimed as the property of others.
62
63
Config 1 Config 2 Config 3 Config 4
Model name
Intel(R) Xeon(R) Platinum 8260L CPU @
2.40GHz
Intel(R) Xeon(R) Platinum 8260L CPU
@ 2.30GHz
Intel(R) Xeon(R) CPU E5-2697 v4 @
2.30GHz
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Core(s) per socket24 24 18 20
Socket(s)2 2 2 2
Memory192GB, DDR4-2933 Mhz (12 x 16GB) 192GB, DDR4-2933 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB)
192GB, DDR4-2666 Mhz (12 x 16GB)
CPU Power PolicyPerformance Performance Performance Powersave
HyperthreadingDisabled Enabled Enabled Enabled
Turbo Boost TechEnabled Enabled Enabled Enabled
L1d cache32K 32K 32K 32K
L1i cache32K 32K 32K 32K
L2 cache1024K 1024K 256K 1024K
L3 cache36608K 33792K 46080K 28160K
Operating SystemFedora release 27 (Twenty Seven) CentOS Linux release 7.6.1810 (Core)
Red Hat Enterprise Linux Server release
7.2 (Maipo)
CentOS Linux release 7.3.1611 (Core)
Bios Version
SE5C620.86B.0D.01.0286.0111201908
16
SE5C620.86B.0D.01.0395.022720191
340
GRRFSDP1.86B0271.R00.1510301446
SE5C620.86B.01.00.0412.020920172159
Configurations
• Subtitle Copy Goes Here

More Related Content

What's hot

What's hot (20)

Stroke in children
Stroke in childrenStroke in children
Stroke in children
 
365 Teachings from the New Revelation for Yesterday, Today and Forever (part 1)
365 Teachings from the New Revelation for Yesterday, Today and Forever  (part 1)365 Teachings from the New Revelation for Yesterday, Today and Forever  (part 1)
365 Teachings from the New Revelation for Yesterday, Today and Forever (part 1)
 
Neonatal metabolic encephalopathy
Neonatal metabolic encephalopathyNeonatal metabolic encephalopathy
Neonatal metabolic encephalopathy
 
Rhabdomyolysis
RhabdomyolysisRhabdomyolysis
Rhabdomyolysis
 
CSF PHYSIOLOGY ANALYSIS NORMAL AND DISEASE
CSF PHYSIOLOGY ANALYSIS NORMAL AND DISEASE CSF PHYSIOLOGY ANALYSIS NORMAL AND DISEASE
CSF PHYSIOLOGY ANALYSIS NORMAL AND DISEASE
 
Cerebral blood flow
Cerebral blood flowCerebral blood flow
Cerebral blood flow
 
Hie
HieHie
Hie
 
Op poisoning - ICU management.Is it straight forward?
Op poisoning - ICU management.Is it straight forward?Op poisoning - ICU management.Is it straight forward?
Op poisoning - ICU management.Is it straight forward?
 
Getting Started with Innoslate
Getting Started with InnoslateGetting Started with Innoslate
Getting Started with Innoslate
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Domain-Driven Data
Domain-Driven DataDomain-Driven Data
Domain-Driven Data
 
Anemia and Preemies: Contemporary Approach to Diagnostics, Preventive Measure...
Anemia and Preemies: Contemporary Approach to Diagnostics, Preventive Measure...Anemia and Preemies: Contemporary Approach to Diagnostics, Preventive Measure...
Anemia and Preemies: Contemporary Approach to Diagnostics, Preventive Measure...
 
Pediatric stroke
Pediatric strokePediatric stroke
Pediatric stroke
 
Pediatric stroke evaluation ;management
Pediatric stroke  evaluation ;managementPediatric stroke  evaluation ;management
Pediatric stroke evaluation ;management
 
Congenital adrenal hyperplasia
Congenital adrenal hyperplasia Congenital adrenal hyperplasia
Congenital adrenal hyperplasia
 
Diabetic ketoacidosis in children
Diabetic ketoacidosis in childrenDiabetic ketoacidosis in children
Diabetic ketoacidosis in children
 
Simplifying MBSE Tasks with Capella and MapleMBSE
Simplifying MBSE Tasks with Capella and MapleMBSESimplifying MBSE Tasks with Capella and MapleMBSE
Simplifying MBSE Tasks with Capella and MapleMBSE
 
Ductus dependent circulation
Ductus dependent circulationDuctus dependent circulation
Ductus dependent circulation
 
Seminar heart diseases in preg
Seminar heart diseases in pregSeminar heart diseases in preg
Seminar heart diseases in preg
 
CVD
CVDCVD
CVD
 

Similar to RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab
National Cheng Kung University
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
SamHoney6
 

Similar to RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions (20)

Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
 
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
 
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
 
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
 
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
 
Christchurch Embedded .NET User Group - Introduction to Microsoft Embedded pl...
Christchurch Embedded .NET User Group - Introduction to Microsoft Embedded pl...Christchurch Embedded .NET User Group - Introduction to Microsoft Embedded pl...
Christchurch Embedded .NET User Group - Introduction to Microsoft Embedded pl...
 
01 foundations
01 foundations01 foundations
01 foundations
 
0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
 
Optimizing Direct X On Multi Core Architectures
Optimizing Direct X On Multi Core ArchitecturesOptimizing Direct X On Multi Core Architectures
Optimizing Direct X On Multi Core Architectures
 
Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)
Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)
Simulating Networks Using Cisco Modeling Labs (TechWiseTV Workshop)
 
VLSI
VLSIVLSI
VLSI
 
Unlocking the SDN and NFV Transformation
Unlocking the SDN and NFV TransformationUnlocking the SDN and NFV Transformation
Unlocking the SDN and NFV Transformation
 
Web of Technologies
Web of TechnologiesWeb of Technologies
Web of Technologies
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
 
Rsockets ofa12
Rsockets ofa12Rsockets ofa12
Rsockets ofa12
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Performance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android Devices
 
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdfJIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
 

More from Intel® Software

More from Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 

RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

  • 1. FROMRENDERMAN22.0®tonext-genrendermanXPU andbeyond:RoleofOPENshadinglanguage(OSL) withIntel®Advancedvectorextensions (Intel®AVX-512) Presenters: Steena Monteiro (Intel) and Max Liani (Pixar Animation Studios) Contributors: Alex M. Wells (Intel), Steena Monteiro (Intel), Louis Feng (Intel), Max Liani (Pixar Animation Studios), Stephen Friedman (Pixar Animation Studios), Larry Gritz (Sony Pictures Imageworks)
  • 2. • This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. • Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks • Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Intel, Xeon and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others • © Intel Corporation. Legal Disclaimers and Optimization Notices 2
  • 3. Shading in Physically Based Rendering 3 Image credit Sony Pictures Imageworks
  • 4. Shading Network • Multiple reusable shading nodes • Connect nodes to define complex materials • Production shading networks can grow very large to 100s, 1000s of nodes. 4
  • 5. C++ Shader Limitations • Lack of context at compile time • Input parameters unknown • Geometry being shaded unknown • Mode of shading unknown • Surrounding shading network unknown • Branchy testing required • Lack of portability • Requires “Performance Ninjas” Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez) 5
  • 6. Open Shading Language • Developed by Sony Pictures Imageworks* • C-like DSL for programmable shading • API to connect shaders into networks • Open source • http://github.com/imageworks/OpenShadingLanguage • Sci-Tech Award* in 2017 Logo owned by Academy of Motion Picture Arts and Sciences for Infobox *Other names and brands may be claimed as the property of others. 6
  • 7. Poster images (c) Sony Pictures*, Paramount*, Warner Brothers*, Disney*, Fox*, Universal* 7
  • 8. Example OSL Shader shader marble (color Cin = .5, float freq = 1.0, output color Cout = 0) { float sum = 0; float freqVal = freq; point Pshad = transform ("object", P); for (int i = 0; i < 6; i++) { sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ; freqVal = 2 * freqVal; } Cout = Cin * sum; } Shader Globals (input set by renderer) Library Calls 8
  • 9. Motivation for SIMD Open Shading Language In its native form, OSL is unable to leverage Intel® Advanced Vector Extensions (Intel® AVX- 512) on Intel® Xeon® Intel has been leading the re-architecture of OSL since 2016 Image © Disney/Pixar 9 *Other names and brands may be claimed as the property of others.
  • 10. oslc Offline compiler Shader Written in OSL Intermediate OSO (Instructions + operands) Renderer (Pixar’s RenderMan*, Autodesk Arnold*, Blender*) Scene Management Ray Tracing/Path Tracing Light Integration OSL Runtime Build Shading Network callbacks Execute Shading Network (per Point) Optimized x86-64 QueryOutputs *Other names and brands may be claimed as the property of others. Render Time Optimization With LLVM* JIT (Just In Time Compilation) Pre- compiled library functions OSL Framework
  • 11. Renderer Shading System execute(ShaderGlobals,…) symbol_address(…) execute_batch(ShaderGlobalsBatch, …) Wide<T>(symbol_address) Submit Single Point Query Results Submit Batch of Points Query Batch of Results ShaderGlobalsBatch Uniform: context *’s Raytype … Queue of Varying: Surface Position Incident Ray Surface Normal … ShaderGlobals New “Batched” Interface SIMD OSL’s Batched Interface 11
  • 12. Renderer (Pixar’s RenderMan*, Autodesk Arnold*, Blender*) Scene Management Ray Tracing/Path Tracing Light Integration SIMD OSL Runtime callbacks Execute Shading Network (per Point) Optimized Intel® AVX-512, AVX2, or AVX QueryOutputs *Other names and brands may be claimed as the property of others. Render Time Optimization With LLVM* Wide JIT (Just In Time Compilation) Pre-compiled library functions Intel® AVX- 512 SIMD OSL Framework Pre-compiled library functions Intel® AVX2 Pre-compiled library functions Intel® AVX 12
  • 13. Components in SIMD OSL Render-time Optimized x86-64 Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 13 *Other names and brands may be claimed as the property of others.
  • 14. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Accessors transparent AOS view of SOA SIMD OSL’s Wide Library 14
  • 15. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Accessors transparent AOS view of SOA Extract data from a lane of the SOA SIMD OSL’s Wide Library 15
  • 16. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Array subscript returns a proxy object to that lane Accessors transparent AOS view of SOA Extract data from a lane of the SOA SIMD OSL’s Wide Library 16
  • 17. my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int mask_value) { Mask mask (mask_value); ASSERT(mask.any_on()); Wide<const float> wScale (wS); Wide<const Vec3> wVec (wVec); Wide<const Matrix44> wMat (wM); Masked<Vec3> wVT_result (wVT, mask); Masked<Vec3> wVS_result (wVS, mask); for(int lane = 0; lane < __OSL_WIDTH; ++lane) { Vec3 V = wVec[lane]; Float F = wScale[lane]; Matrix M = wMat[lane]; wVS_result[lane] = V*F; wVT_result[lane] = transform(M,V); } } Array subscript returns a proxy object to that lane Accessors transparent AOS view of SOA Extract data from a lane of the SOA Skips assignment if lane masked off SIMD OSL’s Wide Library 17
  • 18. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flows Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 18 *Other names and brands may be claimed as the property of others.
  • 19. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks Effective mask (result of combining stack) Divergent Control Flows 19
  • 20. Stack of masks PUSH Effective mask (result of combining stack) if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Divergent Control Flows 20
  • 21. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks PUSH Effective mask (result of combining stack) Divergent Control Flows 21
  • 22. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks PUSH Effective mask (result of combining stack) Divergent Control Flows 22
  • 23. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 23
  • 24. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x NEGATE Stack of masks Effective mask (result of combining stack) PUSH Divergent Control Flows 24
  • 25. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 25
  • 26. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective mask (result of combining stack) Divergent Control Flows 26
  • 27. if (x > 0.5) { ... if (y > 0.5) { … if (powB > 0.23) { … } else { … } } //y } //x Stack of masks POP Effective of mask (result of combining stack) Divergent Control Flows 27
  • 28. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flow Vectorized IR Generation Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 28 *Other names and brands may be claimed as the property of others.
  • 29. General LLVM Code Flow for OSL Operations OSL Retrieve symbols for Operands Emit LLVM-defined operations OR Call appropriate functions Store Result 29
  • 30. What changes in SIMD OSL OSL Retrieve symbols for Operands Load values Initialize values Emit LLVM-defined operations OR Call appropriate functions Store Result 30 OperandsàUniform ResultsàUniform OperandsàUniform ResultsàVarying OperandsàVarying ResultsàUniform OperandsàVarying ResultsàVarying
  • 31. What changes in SIMD OSL 31 SIMD OSL Retrieve symbols for Operands Call uniform function Store Result OperandsàUniform ResultsàUniform
  • 32. What changes in SIMD OSL 32 SIMD OSL Retrieve symbols for Operands Call uniform function Widen Result Store Result OperandsàUniform ResultsàVarying
  • 33. What changes in SIMD OSL 33 SIMD OSL Retrieve symbols for Operands Add effective mask to arguments Call varying function Add address for Results to arguments OperandsàVarying ResultsàVarying
  • 34. What changes in SIMD OSL 34 SIMD OSL Retrieve symbols for Operands Add effective mask to all arguments Call varying function Add address for Results to arguments Allocate a varying temp Widen uniform Operands and store to varying temp OperandsàUniform, and Varying ResultsàVarying
  • 35. What changes in SIMD OSL 35 Unreachable OperandsàVarying ResultsàUniform
  • 36. Components in SIMD OSL Render-time Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flow Vectorized IR Generation “For-each- unique” algorithm Optimized x86-64 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> 36 *Other names and brands may be claimed as the property of others.
  • 37. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 For-Each-Unique Algorithm if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode ); 37
  • 38. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm 38 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  • 39. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm Full flexibility BatchedRendererServices 1st Pass texture(“r.tex”,”mirror”,…); 39 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  • 40. layer = file = Mask = wrap = 3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4 JIT’d Binning For-Each-Unique Algorithm Full flexibility BatchedRendererServices 1st Pass texture(“r.tex”,”mirror”,…); 2nd Pass texture(“g.tex”,”clamp”,…); 40 if (layer == 1) file = “r.tex”; if (layer == 2) file = “g.tex”; if (layer == 3) file = “r.tex”; if (layer == 4) file = “g.tex”; wrap_mode = (layer%2==0)?“clamp”:“mirror”; texture(file, u, v, “wrap”,wrap_mode );
  • 41. Components in SIMD OSL Render-time Optimized x86 Render Time Optimization With LLVM* JIT (Just In Time Compilation) Wide Library Divergent Control Flows Vectorized IR Generation “For-each- unique” algorithm SIMD OSL built-ins 41 Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html; <a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a> *Other names and brands may be claimed as the property of others.
  • 42. 42 Scalar computation with Scalar data types Block Vectorization with intrinsics template<int WidthT> void operator() (MaskedAccessor<float, WidthT> wresult, ConstWideAccessor<Vec3, WidthT> wp) const { #pragma forceinline recursive { #pragma omp simd simdlen(WidthT) for(int l=0; l< WidthT; ++l) { Vec3 p = wp[l]; float perlinResult; HashScalar h; perlin_scalar(perlinResult, h, p.x, p.y, p.z); float scaledResult = 0.5f * (perlinResult + 1.0f); wresult[l] = scaledResult; } } } inline void operator() (float &result, const Vec3 &p) const { HashScalar h; perlin(result, h, p.x, p.y, p.z); result = 0.5f * (result + 1.0f); } Explicit Outer Loop Vectorization (Intel® C++ Compiler) (Clang 5+) SIMD OSL’s Perlin Noise
  • 43. OSL Microbenchmarks: Speedup of SIMD AVX-512 OSL over Scalar OSL 0.125 0.25 0.5 1 2 4 8 16 null sin cos tan asin acos atan sinh cosh tanh atan2 sincos log log2 log10 logb exp exp2 expm1 pow erf erfc radians degrees sqrt inversesqrt hypot abs fabs sign floor ceil roundtruncmod min maxclampmix isnan isfinite select dot cross length distance normalize reflect fresnel rotate transform transform_matrix matrix_object_camera determinant transpose linearstep smooth_linearstep noise_perlin noise_cell noise_simplex noise_gabor pnoise_perlin pnoise_cell pnoise_gabor spline_bezier spline_bspline spline_catmull-rom spline_hermitespline_linearspline_constant 48 threads on Intel(R) Xeon(R) Platinum 8260L CPU @2.30GHz (config 2) Average: 6.9x Geomean: 6.14x 43 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 44. OSL SIMD Performance at Maximum Batch Utilization OSL’s testshade running Intel® AVX-512® on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 leopard concrete diamond oak marble Speedupatmaxbatchsize 5.2x 6x 10x 12x 15x 44 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 45. SIMD OSL Intel® AVX-512 VS AVX2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 leopard concrete diamond plate oak marble thread donut Speedup 1.6x 1.9x 1.1x OSL’s testshade running Intel® AVX-512 and AVX2 on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 1.3x 1.3x 1.4x 1.8x 45 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 46. Evolution of SIMD OSL—Proof of Concept to Production 2016‒2019 SIMD OSL Library SIMD OSL Framework SIMD OSL Performance Intel® AVX-512, AVX2, AVX-specific libraries Masking and scatter- gather 17k+ tests Improved performance on built-in functions Compiler + platform support Reduction in JIT time Coverage for built-in function variants Handling treacherous control flows Noise functions with options LLVM optimization passes to improve AVX2 46
  • 47. SIMD Open Shading Language Open Shading Language https://github.com/imageworks/OpenShadingLanguage https://gitlab.com/intel-osl/BatchedOSL 47
  • 48. This Page Intentionally Left Blank 48
  • 49. Intel® AVX-512 Performance Vs Batch Utilization marble oak diamond concrete leopard 0 5 10 15 batch 1 batch 2 batch 3 batch 4 batch 5 batch 6 batch 7 batch 8 batch 9 batch 10 batch 11 batch 12 batch 13 batch 14 batch 15 batch 16 Speedupfrombatching Performance gain with increased batch utilization 15x 12x 10x 6x 5.2x OSL’s testshade running Intel® AVX-512® on 48 threads of Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1) 49 *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 50. 22.4 Shading Speedup with SIMD OSL 50 1 1.2 1.4 1.6 1.8 2 2.2 Bonnie’s room Fillmore Bonnie Speedup CLX8260L (24c, 2.3GHz) 1.26x 1.37x 2.06x Image © Disney/Pixar Image © Disney/Pixar Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) *Other names and brands may be claimed as the property of others. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 51. 22.4’s Overall Rendering Speedup with SIMD OSL 51 1 1.05 1.1 1.15 1.2 1.25 1.3 Bonnie’s room Fillmore Bonnie Speedup CLX8260L (24c, 2.3GHz) 1.11x 1.17x 1.27x *Other names and brands may be claimed as the property of others. Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 52. Bonnie • Real production character with 55 shader networks • 85663 shader operations on 67680 symbols (post-optimization) Image © Disney/Pixar *Other names and brands may be claimed as the property of others. 52 Single Point Batched Amdahl’s Law 66.64% Batch Utilization 2.05x Shading Speedup Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 53. Performance Progression 3 factors at play: ● Efficiency of the generated vectorized shader code ● Effective vectorization of the shading interface ● How effective is the renderer in taking advantage of the vectorized shading language 53 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 54. Efficiency in the shading language Most effort up to now on the quality of the shader code generation ● Masked control flow for vectorized execution ● Optimization of noises and math functions ● Optimization of texture calls. 54Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 55. Efficiency in the Shading API 55 The shading language calls into the renderer ● To access data, primvars, tranforms, etc… ● To compute things, texture interpolation, trace rays, etc… ● To return values ● All of the above is nicely vectorized (batched) ● We call across the API boundaries fewer times Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 56. Efficiency in the Renderer 56 We started with a vectorized renderer ● RIS is one of the few vectorized renderers in the industry that works on ray batches ● It turns out that our batch granularity is not enabling effective vectorization ● Results we see today are a fraction of the benefit we would get. Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 57. Efficiency in the Renderer What is efficient? ● Portions of the renderer where execution is coherent ● Displacement shading ● Camera rays hits What is inefficient? ● Indirect illumination ● Deep bounces 57 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 58. Efficiency in the Renderer 58 *Other names and brands may be claimed as the property of others. 1 point 2 points 3 points 4 points 5 points 6 points 7 points 8 points 9 points 10 points 11 points 12 points 13 points 14 points 15 points 16 points 0 10 20 30 40 50 60 70 80 1 Bounce 2 Bounces 3 Bounces 5 Bounces 9 Bounces 7.3% 13.9% 18.9% 22.3% 25.4% 76.6% 67.1% 60.9% 56.5% 52.6% %ofBatchesSubmitted Pixar’s RenderMan* 22.dev running on all 40 threads of Intel® Xeon® Gold 6148 @2.4Ghz (config 4) For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 59. Efficiency in the Renderer How do we currently accomodate for low occupancy? ● We switch over single point evaluation for small batches. ● We use some heuristic to determine when to switch. ● A threshold point of 4 active lanes tends to be a decent starting point. ● This may change as more optimizations are done ● However it would be best to guarantee high SIMD occupancy 59 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 60. Towards a new Rendering Architecture Batches are currently determined by the size of bucket rendering ● Computational workload is uneven throughout the image ● Larger buckets gives more points, higher occupancy ● Larger buckets means one thread may be stuck rendering a single heavy buckets for long time, reducing thread scaling ● Decent bucket size for good thread load balancing is 8x8 or 16x16. ● This is a batch size of 64-256. ● We would need 2k-8k batch size at least. 60 Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 61. Different options at hand ● Wavefront rendering ● Shading queues ● Non image-space decomposition scheduling ● The new architecture in being implemented in Pixar’s Renderman® XPU ● Stay tuned 61 Towards a new Rendering Architecture Image © Disney/Pixar *Other names and brands may be claimed as the property of others.
  • 62. OSL Shaders • Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl • Modifications: • Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl • Diamond plate - https://github.com/varkenvarken/osl- shaders/blob/master/Shaders/diamondplateshader.osl • Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN- Experimental/Threads.osl • Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN- Experimental/TheDonutShader.osl • Oak – https://renderman.pixar.com/forum/download.php • Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl • Marble - https://renderman.pixar.com/forum/download.php • Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl < float grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0 ,0)); --- > float grain=noise("gabor",p,8); *Other names and brands may be claimed as the property of others. 62
  • 63. 63 Config 1 Config 2 Config 3 Config 4 Model name Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Core(s) per socket24 24 18 20 Socket(s)2 2 2 2 Memory192GB, DDR4-2933 Mhz (12 x 16GB) 192GB, DDR4-2933 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB) 192GB, DDR4-2666 Mhz (12 x 16GB) CPU Power PolicyPerformance Performance Performance Powersave HyperthreadingDisabled Enabled Enabled Enabled Turbo Boost TechEnabled Enabled Enabled Enabled L1d cache32K 32K 32K 32K L1i cache32K 32K 32K 32K L2 cache1024K 1024K 256K 1024K L3 cache36608K 33792K 46080K 28160K Operating SystemFedora release 27 (Twenty Seven) CentOS Linux release 7.6.1810 (Core) Red Hat Enterprise Linux Server release 7.2 (Maipo) CentOS Linux release 7.3.1611 (Core) Bios Version SE5C620.86B.0D.01.0286.0111201908 16 SE5C620.86B.0D.01.0395.022720191 340 GRRFSDP1.86B0271.R00.1510301446 SE5C620.86B.01.00.0412.020920172159 Configurations
  • 64. • Subtitle Copy Goes Here