RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

FROMRENDERMAN22.0®tonext-genrendermanXPU
andbeyond:RoleofOPENshadinglanguage(OSL)
withIntel®Advancedvectorextensions
(Intel®AVX-512) Presenters: Steena Monteiro (Intel) and Max Liani (Pixar
Animation Studios)
Contributors: Alex M. Wells (Intel), Steena Monteiro (Intel),
Louis Feng (Intel),
Max Liani (Pixar Animation Studios), Stephen Friedman (Pixar
Animation Studios),
Larry Gritz (Sony Pictures Imageworks)

• This document contains information on products, services and/or processes in development. All information provided here is subject to change without
notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
• Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
intel.com.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks
•
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes.
Any differences in your system hardware, software or configuration may affect your actual performance.
• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm
whether referenced data are accurate.
• Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer
to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
• Intel, Xeon and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others
• © Intel Corporation.
Legal Disclaimers and Optimization Notices
2

Shading in Physically Based Rendering
3
Image credit Sony Pictures Imageworks

Shading Network
• Multiple reusable shading
nodes
• Connect nodes to define
complex materials
• Production shading
networks can grow very
large to 100s, 1000s of
nodes.
4

C++ Shader Limitations
• Lack of context at compile time
• Input parameters unknown
• Geometry being shaded
unknown
• Mode of shading unknown
• Surrounding shading
network unknown
• Branchy testing required
• Lack of portability
• Requires “Performance Ninjas”
Image Credit: Ninja Working AT Desk from Vector.me (by Hector Gomez)
5

Open Shading
Language
• Developed by Sony Pictures Imageworks*
• C-like DSL for programmable shading
• API to connect shaders into networks
• Open source
• http://github.com/imageworks/OpenShadingLanguage
• Sci-Tech Award* in 2017
Logo owned by Academy of Motion Picture Arts and Sciences for Infobox
*Other names and brands may be claimed as the property of others.
6

Poster images (c) Sony Pictures*, Paramount*, Warner
Brothers*, Disney*, Fox*, Universal*
7

Example OSL Shader
shader marble (color Cin = .5,
float freq = 1.0,
output color Cout = 0)
{
float sum = 0;
float freqVal = freq;
point Pshad = transform ("object", P);
for (int i = 0; i < 6; i++)
{
sum = sum + 1/freqVal * abs(.5 - noise( 4 * freqVal * Pshad)) ;
freqVal = 2 * freqVal;
}
Cout = Cin * sum;
}
Shader
Globals
(input set by renderer)
Library Calls
8

Motivation for SIMD Open Shading
Language
In its native form, OSL is
unable to leverage Intel®
Advanced Vector
Extensions (Intel® AVX-
512) on Intel® Xeon®
Intel has been leading the
re-architecture of OSL
since 2016
Image © Disney/Pixar
9

oslc
Offline
compiler
Shader
Written in OSL
Intermediate OSO
(Instructions + operands)
Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
OSL Runtime
Build
Shading
Network
callbacks
Execute
Shading
Network
(per Point)
Optimized
x86-64
QueryOutputs
Render Time
Optimization
With
LLVM* JIT
(Just In Time Compilation)
Pre-
compiled
library
functions
OSL Framework

Renderer Shading System
execute(ShaderGlobals,…)
symbol_address(…)
execute_batch(ShaderGlobalsBatch, …)
Wide<T>(symbol_address)
Submit Single Point
Query Results
Submit Batch
of Points
Query Batch of
Results
ShaderGlobalsBatch
Uniform:
context *’s
Raytype
…
Queue of Varying:
Surface Position
Incident Ray
Surface Normal
…
ShaderGlobals
New “Batched” Interface
SIMD OSL’s Batched Interface
11

Renderer
(Pixar’s RenderMan*, Autodesk Arnold*, Blender*)
Scene Management
Ray Tracing/Path Tracing
Light Integration
SIMD OSL Runtime
callbacks
Execute
Shading
Network
(per Point)
Optimized Intel®
AVX-512, AVX2,
or AVX
QueryOutputs
Render Time
Optimization
With
LLVM* Wide JIT
Pre-compiled
library
functions
Intel® AVX-
512
SIMD OSL Framework
Pre-compiled
library
functions
Intel® AVX2
Pre-compiled
library
functions
Intel® AVX
12

Components in
SIMD OSL Render-time
Optimized x86-64
Render Time
Optimization
With
LLVM* JIT
Wide Library
Wizard Oz Castle Clipart: https://www.clipart.email/clipart/wizard-of-oz-castle-clipart-18891.html;
<a href="https://www.clipart.email/download/374139.html" title="Image from clipart.email"><img src="https://cdn.clipart.email/e173b51872baa07a65151101799b4f7d_wizard-of-oz-clipart-emerald-castle-pencil-and-in-color-wizard-_1300-1390.jpeg" width="350" alt="Wizard Of Oz Castle Clipart" /></a>
13

my_callback(void *wS, void *wM, void *wVec, void *wVS, void *wVT, unsigned int
mask_value)
{
Mask mask (mask_value);
ASSERT(mask.any_on());
Wide<const float> wScale (wS);
Wide<const Vec3> wVec (wVec);
Wide<const Matrix44> wMat (wM);
Masked<Vec3> wVT_result (wVT, mask);
Masked<Vec3> wVS_result (wVS, mask);
for(int lane = 0; lane < __OSL_WIDTH; ++lane) {
Vec3 V = wVec[lane];
Float F = wScale[lane];
Matrix M = wMat[lane];
wVS_result[lane] = V*F;
wVT_result[lane] = transform(M,V);
}
}
Accessors
transparent
AOS view of SOA
SIMD OSL’s Wide Library
14

mask_value)
{
}
}
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
15

mask_value)
{
}
}
Array subscript returns a
proxy object to that lane
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
16

mask_value)
{
}
}
Array subscript returns a
proxy object to that lane
Accessors
transparent
AOS view of SOA
Extract data
from a lane
of the SOA
Skips assignment if lane masked off
17

Components in
Render Time
Optimization
With
LLVM* JIT
Wide Library
Divergent
Control Flows
Optimized x86-64
18

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
Effective mask
(result of combining stack)
Divergent Control Flows
19

Stack of masks
PUSH
Effective mask
if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
20

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
PUSH
Effective mask
21

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
PUSH
Effective mask
22

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
23

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
NEGATE
Stack of masks
Effective mask
PUSH
24

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
25

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective mask
26

if (x > 0.5)
{
...
if (y > 0.5)
{
…
if (powB > 0.23)
{
…
}
else
{
…
}
} //y
} //x
Stack of masks
POP
Effective of mask
27

Components in
Render Time
Optimization
With
LLVM* JIT
Wide Library
Divergent
Control Flow
Vectorized IR
Generation
Optimized x86-64
28

General LLVM Code Flow for
OSL Operations
OSL
Retrieve symbols for
Operands
Emit LLVM-defined operations
OR
Call appropriate functions
Store Result
29

What changes in SIMD OSL
OSL
Operands
Load values
Initialize values
Emit LLVM-defined operations
OR
Call appropriate functions
Store Result
30
OperandsàUniform
ResultsàUniform
OperandsàUniform
ResultsàVarying
OperandsàVarying
ResultsàUniform
OperandsàVarying
ResultsàVarying

31
SIMD OSL
Operands
Call uniform
function
Store Result
OperandsàUniform
ResultsàUniform

32
SIMD OSL
Operands
Call uniform
function
Widen Result
Store Result
OperandsàUniform
ResultsàVarying

33
SIMD OSL
Operands
Add effective mask to
arguments
Call varying function
Add address for
Results to arguments
OperandsàVarying
ResultsàVarying

34
SIMD OSL
Operands
Add effective mask to
all arguments
Call varying function
Add address for
Results to arguments
Allocate a varying
temp
Widen uniform
Operands and store to
varying temp
OperandsàUniform,
and Varying
ResultsàVarying

35
Unreachable
OperandsàVarying
ResultsàUniform

Components in
Render Time
Optimization
With
LLVM* JIT
Wide Library
Divergent
Control Flow
Vectorized IR
Generation
“For-each-
unique”
algorithm
Optimized x86-64
36

layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
For-Each-Unique Algorithm
if (layer == 1) file = “r.tex”;
if (layer == 2) file = “g.tex”;
wrap_mode = (layer%2==0)?“clamp”:“mirror”;
texture(file, u, v, “wrap”,wrap_mode );
37

layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
38

layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
Full flexibility
BatchedRendererServices
1st Pass
texture(“r.tex”,”mirror”,…);
39

layer =
file =
Mask =
wrap =
3 3 1 2 1 1 2 1 2 2 2 2 3 3 3 1 4
JIT’d
Binning
Full flexibility
BatchedRendererServices
1st Pass
texture(“r.tex”,”mirror”,…);
2nd Pass
texture(“g.tex”,”clamp”,…);
40

Components in
Optimized x86
Render Time
Optimization
With
LLVM* JIT
Wide Library
Divergent
Control Flows
Vectorized IR
Generation
“For-each-
unique”
algorithm
SIMD OSL
built-ins
41

42
Scalar computation
with
Scalar data types
Block Vectorization
with intrinsics
template<int WidthT> void operator() (MaskedAccessor<float, WidthT> wresult,
ConstWideAccessor<Vec3, WidthT> wp) const {
#pragma forceinline recursive
{
#pragma omp simd simdlen(WidthT)
for(int l=0; l< WidthT; ++l) {
Vec3 p = wp[l];
float perlinResult;
HashScalar h;
perlin_scalar(perlinResult, h, p.x, p.y, p.z);
float scaledResult = 0.5f * (perlinResult + 1.0f);
wresult[l] = scaledResult;
}
}
}
inline void operator() (float &result, const Vec3 &p) const
{
HashScalar h;
perlin(result, h, p.x, p.y, p.z);
result = 0.5f * (result + 1.0f);
}
Explicit
Outer Loop
Vectorization
(Intel® C++ Compiler)
(Clang 5+)
SIMD OSL’s Perlin Noise

OSL Microbenchmarks: Speedup of
SIMD AVX-512 OSL over Scalar OSL
0.125
0.25
0.5
1
2
4
8
16
null
sin cos tan
asin
acos
atan
sinh
cosh
tanh
atan2
sincos
log
log2
log10
logb
exp
exp2
expm1
pow
erf
erfc
radians
degrees
sqrt
inversesqrt
hypot
abs
fabs
sign
floor
ceil
roundtruncmod
min
maxclampmix
isnan
isfinite
select
dot
cross
length
distance
normalize
reflect
fresnel
rotate
transform
transform_matrix
matrix_object_camera
determinant
transpose
linearstep
smooth_linearstep
noise_perlin
noise_cell
noise_simplex
noise_gabor
pnoise_perlin
pnoise_cell
pnoise_gabor
spline_bezier
spline_bspline
spline_catmull-rom
spline_hermitespline_linearspline_constant
48 threads on Intel(R) Xeon(R) Platinum 8260L CPU @2.30GHz (config 2)
Average: 6.9x
Geomean: 6.14x
43
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

OSL SIMD Performance at Maximum
Batch Utilization
OSL’s testshade running Intel® AVX-512® on 48 threads of
Intel(R) Xeon(R) Platinum 8260L CPU @2.40 Ghz (config 1)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
leopard concrete diamond oak marble
Speedupatmaxbatchsize
5.2x
6x
10x
12x
15x
44

SIMD OSL Intel® AVX-512 VS AVX2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
leopard concrete diamond plate oak marble thread donut
Speedup
1.6x 1.9x
1.1x
OSL’s testshade running Intel® AVX-512 and AVX2 on 48 threads of
1.3x 1.3x
1.4x
1.8x
45

Evolution of SIMD OSL—Proof of
Concept to Production 2016‒2019
SIMD OSL
Library
SIMD OSL
Framework
SIMD OSL
Performance
Intel® AVX-512,
AVX2, AVX-specific
libraries
Masking and scatter-
gather
17k+ tests
Improved
performance on
built-in functions
Compiler + platform
support
Reduction in JIT
time
Coverage for built-in
function variants
Handling
treacherous control
flows
Noise functions
with options
LLVM optimization
passes to improve
AVX2
46

SIMD Open Shading
Language
Open Shading
Language
https://github.com/imageworks/OpenShadingLanguage
https://gitlab.com/intel-osl/BatchedOSL
47

This Page Intentionally Left Blank
48

Intel® AVX-512 Performance
Vs Batch Utilization
marble
oak
diamond
concrete
leopard
0
5
10
15
batch 1 batch 2 batch 3 batch 4 batch 5 batch 6 batch 7 batch 8 batch 9 batch 10 batch 11 batch 12 batch 13 batch 14 batch 15 batch 16
Speedupfrombatching
Performance gain with increased batch utilization
15x
12x
10x
6x
5.2x
OSL’s testshade running Intel® AVX-512® on 48 threads of
49

22.4 Shading Speedup
with SIMD OSL
50
1
1.2
1.4
1.6
1.8
2
2.2
Bonnie’s room Fillmore Bonnie
Speedup
CLX8260L (24c, 2.3GHz)
1.26x
1.37x
2.06x
Run on 48 threads of 24-core Intel(R) Xeon(R) Platinum 8260L CPU @ 2.30GHz (config 2)

22.4’s Overall Rendering
Speedup with SIMD OSL
51
1
1.05
1.1
1.15
1.2
1.25
1.3
Bonnie’s room Fillmore Bonnie
Speedup
CLX8260L (24c, 2.3GHz)
1.11x
1.17x
1.27x

Bonnie
• Real production character with 55 shader networks
• 85663 shader operations on 67680 symbols (post-optimization)
52
Single Point Batched
Amdahl’s
Law
66.64%
Batch
Utilization
2.05x Shading
Speedup

Performance Progression
3 factors at play:
● Efficiency of the generated vectorized shader code
● Effective vectorization of the shading interface
● How effective is the renderer in taking advantage
of the vectorized shading language
53

Efficiency in the shading language
Most effort up to now on the quality
of the shader code generation
● Masked control flow for
vectorized execution
● Optimization of noises and math
functions
● Optimization of texture calls.
54Image © Disney/Pixar

Efficiency in the Shading API
55
The shading language calls into the renderer
● To access data, primvars, tranforms, etc…
● To compute things, texture interpolation, trace rays,
etc…
● To return values
● All of the above is nicely vectorized (batched)
● We call across the API boundaries fewer times

Efficiency in the Renderer
56
We started with a vectorized renderer
● RIS is one of the few vectorized renderers in
the industry that works on ray batches
● It turns out that our batch granularity is not
enabling effective vectorization
● Results we see today are a fraction of the
benefit we would get.

What is efficient?
● Portions of the renderer where execution is coherent
● Displacement shading
● Camera rays hits
What is inefficient?
● Indirect illumination
● Deep bounces
57

58
1 point
2 points
3 points
4 points
5 points
6 points
7 points
8 points
9 points
10 points
11 points
12 points
13 points
14 points
15 points
16 points
0
10
20
30
40
50
60
70
80
1 Bounce 2 Bounces 3 Bounces 5 Bounces 9 Bounces
7.3%
13.9%
18.9%
22.3%
25.4%
76.6%
67.1%
60.9%
56.5%
52.6%
%ofBatchesSubmitted
Pixar’s RenderMan* 22.dev running on all 40 threads of Intel® Xeon® Gold 6148
@2.4Ghz (config 4)

How do we currently accomodate for low occupancy?
● We switch over single point evaluation for small batches.
● We use some heuristic to determine when to switch.
● A threshold point of 4 active lanes tends to be a decent starting point.
● This may change as more optimizations are done
● However it would be best to guarantee high SIMD occupancy
59

Towards a new Rendering Architecture
Batches are currently determined by the size of bucket rendering
● Computational workload is uneven throughout the image
● Larger buckets gives more points, higher occupancy
● Larger buckets means one thread may be stuck rendering a single heavy
buckets for long time, reducing thread scaling
● Decent bucket size for good thread load balancing is 8x8 or 16x16.
● This is a batch size of 64-256.
● We would need 2k-8k batch size at least.
60

Different options at hand
● Wavefront rendering
● Shading queues
● Non image-space decomposition scheduling
● The new architecture in being implemented in Pixar’s Renderman® XPU
● Stay tuned
61
Towards a new Rendering Architecture

OSL Shaders
• Concrete - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/concrete.osl
• Modifications:
• Leopard - https://github.com/varkenvarken/osl-shaders/blob/master/Shaders/leopard.osl
• Diamond plate - https://github.com/varkenvarken/osl-
shaders/blob/master/Shaders/diamondplateshader.osl
• Thread - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-
Experimental/Threads.osl
• Donut - https://github.com/ADN-DevTech/3dsMax-OSL-Shaders/blob/master/OSL/ADN-
Experimental/TheDonutShader.osl
• Oak – https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/oak.osl
• Marble - https://renderman.pixar.com/forum/download.php
• Pixar’s RenderMan* examples ./scenes/pattern/osl/shaders/marble.osl
< float
grain=noise("gabor",p,8,"bandwidth",4,"anisotropic",2,"direction",vector(SandDensity,0
,0));
---
> float grain=noise("gabor",p,8);
62

63
Config 1 Config 2 Config 3 Config 4
Model name
Intel(R) Xeon(R) Platinum 8260L CPU @
2.40GHz
Intel(R) Xeon(R) Platinum 8260L CPU
@ 2.30GHz
Intel(R) Xeon(R) CPU E5-2697 v4 @
2.30GHz
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Core(s) per socket24 24 18 20
Socket(s)2 2 2 2
Memory192GB, DDR4-2933 Mhz (12 x 16GB) 192GB, DDR4-2933 Mhz (12 x 16GB) 128GB, DDR4-2400 MHz (8 x 16GB)
192GB, DDR4-2666 Mhz (12 x 16GB)
CPU Power PolicyPerformance Performance Performance Powersave
HyperthreadingDisabled Enabled Enabled Enabled
Turbo Boost TechEnabled Enabled Enabled Enabled
L1d cache32K 32K 32K 32K
L1i cache32K 32K 32K 32K
L2 cache1024K 1024K 256K 1024K
L3 cache36608K 33792K 46080K 28160K
Operating SystemFedora release 27 (Twenty Seven) CentOS Linux release 7.6.1810 (Core)
Red Hat Enterprise Linux Server release
7.2 (Maipo)
CentOS Linux release 7.3.1611 (Core)
Bios Version
SE5C620.86B.0D.01.0286.0111201908
16
SE5C620.86B.0D.01.0395.022720191
340
GRRFSDP1.86B0271.R00.1510301446
SE5C620.86B.01.00.0412.020920172159
Configurations

RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

More Related Content

Similar to RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions

More from Intel® Software

Recently uploaded

RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vector Extensions | SIGGRAPH 2019 Technical Sessions