Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
INTRODUCTION 
TO 
MONTE 
CARLO 
RAY 
TRACING 
OPENCL 
IMPLEMENTATION 
TAKAHIRO 
HARADA 
9/2014
RECAP 
OF 
LAST 
SESSION 
y Talked 
about 
theory 
y BRDFs 
‒ Reflec8on, 
Refrac8on, 
Diffuse, 
Microfacet 
y Fresnel 
...
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Direct 
illumina<on 
for( 
i, 
j 
) 
{ 
ray 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
)...
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Indirect 
illumina<on 
for( 
i, 
j 
) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera,...
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Direct 
illumina<on 
Indirect 
Illumina<on 
for( 
i, 
j 
) 
{ 
ray 
= 
PrimaryRayGen...
COMPARISON 
Direct 
illumina<on 
Indirect 
illumina<on 
6 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
impleme...
WHY 
OPENCL? 
y Speed! 
‒ GPU 
can 
accelerate 
it 
‒ Why? 
Faster 
is 
the 
beXer 
y OpenCL 
is 
an 
API 
for 
GPU 
com...
PORTING 
TO 
OPENCL 
(FIRST 
ATTEMPT)
THINGS 
TO 
BE 
DONE 
DATA 
STRUCTURE 
y No 
pointer 
in 
OpenCL* 
y Change 
pointer 
to 
index 
y Stored 
in 
a 
flat ...
10 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
THINGS 
TO 
BE 
DONE 
y Nod...
THINGS 
TO 
BE 
DONE 
DATA 
STRUCTURE 
y Material 
‒ Texture 
entry 
y Buffer<Material> 
material; 
y Buffer<char> 
tex...
THINGS 
TO 
BE 
DONE 
WRITING 
OPENCL 
KERNEL 
CPU 
code 
OpenCL 
kernel 
for( 
i, 
j 
) 
{ 
ray, 
rayState 
= 
PrimaryRay...
IT 
WORKS 
BUT… 
y This 
approach 
is 
simple 
y But 
a 
lot 
of 
issues 
13 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tra...
DRAWBACKS 
PERFORMANCE 
y Likely 
not 
u8lize 
hardware 
efficiently 
‒ SIMD 
divergence 
‒ GPU 
occupancy 
(latency) 
y...
GPU 
ARCHITECTURE
OPENCL 
ON 
CPU 
y Processing 
element 
executes 
Work 
item 
16 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL ...
GPU 
VS 
CPU 
y Processing 
element 
executes 
Work 
item 
‒ A 
SIMD 
lane 
(64*) 
y Compute 
unit 
executes 
Work 
grou...
HIGH 
LEVEL 
DESCRIPTION 
y Today’s 
GPU 
is 
similar 
to 
a 
CPU 
(if 
you 
look 
at 
very 
high 
level) 
‒ GPU 
is 
an ...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
WIDE 
SIMD 
EXECUTION 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
br...
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branche...
LATENCY 
y Highest 
latency 
is 
from 
memory 
access 
y CPU 
prevent 
it 
by 
having 
larger 
cache 
‒ Latency 
of 
cac...
GPU 
LATENCY 
HIDING 
y GPU 
can 
execute 
at 
full 
speed 
if 
there 
are 
only 
ALU 
instruc8ons 
(Inst. 
0 
-­‐ 
2) 
*...
GPU 
LATENCY 
HIDING 
y When 
stalled, 
switch 
to 
another 
work 
group 
y Could 
fill 
the 
stall 
with 
instruc8ons 
...
HOW 
MANY 
WGS 
CAN 
WE 
EXECUTE 
PER 
SIMD 
y 10 
wavefronts 
(64WIs) 
per 
SIMD 
is 
the 
max 
y It 
depends 
on 
loca...
ADVICE 
TO 
REDUCE 
VGPR 
PRESSURE 
GET 
MORE 
PERFORMANCE 
FROM 
GPU 
y Don’t 
write 
a 
large 
kernel 
y If 
the 
prog...
VLIW4 
(NI), 
VLIW5 
(EG) 
Scalar 
Architecture 
(SI, 
CI) 
Lane 
1 
Lane 
2 
Lane 
3 
Lane 
4 
33 
| 
Introduc8on 
to 
Mo...
ANOTHER 
SOLUTION 
y Spli{ng 
computa8on 
into 
mul8ple 
kernels 
‒ Primary 
ray 
gen 
kernel 
‒ Trace 
kernel 
‒ Evaluat...
OTHER 
BENEFITS? 
35 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Ray 
Gener...
PORTING 
TO 
OPENCL 
(SECOND 
ATTEMPT)
SPLITTING 
KERNELS 
TRANSFORMING 
CPU 
CODE 
Naïve 
CPU 
implementa<on 
Preparing 
for 
OpenCL 
implementa<on 
forAll() 
{...
SPLITTING 
KERNELS 
CPU 
implementa<on 
Host 
code 
forAll() 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); ...
SPLITTING 
KERNELS 
Host 
Code 
Device 
Code 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
...
DESIGN 
LOCALIZE 
BRANCH 
Camera 
Type 
Brdf 
Type 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel...
DESIGN 
RAY 
STATE 
y Cannot 
keep 
state 
between 
kernels 
y Ray 
state 
needs 
to 
be 
saved 
to/restored 
from 
glob...
DESIGN 
RAY 
STATE 
y Cannot 
keep 
state 
between 
kernels 
y Ray 
state 
needs 
to 
be 
saved 
to/restored 
from 
glob...
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
Prima...
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
(3/8,...
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
(3/8,...
DIRECT 
ILLUMINATION 
COMPUTATION 
y SampleLightKernel 
‒ Want 
to 
keep 
the 
work 
uniform 
‒ Different 
# 
of 
light 
...
DIRECT 
ILLUMINATION 
COMPUTATION 
y SampleLightKernel 
y TraceRayKernel 
‒ Check 
if 
the 
point 
on 
the 
light 
is 
v...
SAMPLE 
NEXT 
RAY 
y Compute 
next 
ray 
by 
sampling 
BRDF 
y Store 
ray 
and 
ray 
state 
48 
| 
Introduc8on 
to 
Mont...
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
...
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
...
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
...
SO 
FAR 
y Explained 
an 
OpenCL 
implementa8on 
of 
a 
simple 
path 
tracer 
y Easy 
to 
extend 
from 
here 
y Extensi...
ADVANCED 
TOPICS
INSTANCING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Small 
memory 
overhead 
‒ Shares 
geometric 
...
INSTANCING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Small 
memory 
overhead 
‒ Shares 
geometric 
...
LAYERED 
MATERIAL 
y Binary 
tree 
of 
BRDFs 
y Leaf 
node 
‒ BRDF 
y Internal 
node 
‒ Blend 
func8on 
‒ Fresnel 
blen...
LAYERED 
MATERIAL 
y Binary 
tree 
of 
BRDFs 
y Leaf 
node 
‒ BRDF 
y Internal 
node 
‒ Blend 
func8on 
‒ Fresnel 
blen...
MORE 
ADVANCED 
TOPICS
VR 
y Latency 
is 
super 
important 
y To 
improve 
a 
frame 
rendering 
8me, 
‒ Used 
mul8ple 
GPUs 
‒ Foveated 
render...
VR 
y Latency 
is 
important 
y To 
improve 
a 
frame 
rendering 
8me, 
‒ Used 
mul8ple 
GPUs 
‒ Foveated 
rendering 
y...
DISPLACEMENT 
MAPPING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Pre 
tessella8on 
‒ Required 
memor...
DISPLACEMENT 
MAPPING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Pre 
tessella8on 
‒ Required 
memor...
DISPLACEMENT 
MAPPING 
y TraceKernel 
‒ If 
a 
ray 
hit 
a 
quad 
with 
displacement 
map, 
save 
(ray, 
primi8ve) 
to 
a...
VECTOR 
DISPLACEMENT 
IN 
ACTION 
Base 
mesh 
Vector 
displacement 
64 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
Op...
OPEN 
SHADING 
LANGUAGE 
y OSL 
itself 
has 
nothing 
to 
do 
with 
OpenCL 
y Many 
use 
cases 
y Using 
OSL 
in 
OCL 
...
SPIR 
y Standard 
Portable 
Intermediate 
Representa8on 
y Based 
on 
LLVM 
IR 
(32, 
64) 
y Useful 
to 
ship 
OpenCL 
...
SPIR 
CREATE 
SPIR 
BINARY 
y Offline 
compiler 
‒ clang-­‐spir* 
-­‐cc1 
-­‐emit-­‐llvm-­‐bc 
-­‐triple 
spir-­‐unknown-...
68 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
Upcoming SlideShare
Loading in …5
×

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

15,612 views

Published on

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

Published in: Technology
  • Be the first to comment

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

  1. 1. INTRODUCTION TO MONTE CARLO RAY TRACING OPENCL IMPLEMENTATION TAKAHIRO HARADA 9/2014
  2. 2. RECAP OF LAST SESSION y Talked about theory y BRDFs ‒ Reflec8on, Refrac8on, Diffuse, Microfacet y Fresnel is everywhere y Monte Carlo Ray Tracing ‒ Intui8ve understanding of Monte Carlo Integra8on ‒ Simple sampling (Random sampling) ‒ BeXer sampling (Importance sampling) ‒ Layered material hXp://www.slideshare.net/takahiroharada/introduc8on-­‐to-­‐monte-­‐carlo-­‐ray-­‐tracing-­‐cedec2013 2 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  3. 3. REVIEW SIMPLE CPU MC RAY TRACER Direct illumina<on for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } } 3 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  4. 4. REVIEW SIMPLE CPU MC RAY TRACER Indirect illumina<on for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 4 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  5. 5. REVIEW SIMPLE CPU MC RAY TRACER Direct illumina<on Indirect Illumina<on for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } } 5 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
  6. 6. COMPARISON Direct illumina<on Indirect illumina<on 6 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  7. 7. WHY OPENCL? y Speed! ‒ GPU can accelerate it ‒ Why? Faster is the beXer y OpenCL is an API for GPU compute y OpenCL is not only for graphics programmers y OpenCL does not always require a GPU ‒ Runs on CPU too ‒ Runs if there is a CPU (everywhere) y If renderer is wriXen in OpenCL, runs on Windows, Linux, MacOSX 7 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 J
  8. 8. PORTING TO OPENCL (FIRST ATTEMPT)
  9. 9. THINGS TO BE DONE DATA STRUCTURE y No pointer in OpenCL* y Change pointer to index y Stored in a flat memory y Not suited for par8al update *Shared Virtual Memory (OpenCL 2.0) 9 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  10. 10. 10 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 THINGS TO BE DONE y Node data for a binary tree ‒ Spa8al accelera8on structure (BVH) ‒ Shading network y Buffer<NodeData> nodeData; DATA STRUCTURE Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1 Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1 Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1
  11. 11. THINGS TO BE DONE DATA STRUCTURE y Material ‒ Texture entry y Buffer<Material> material; y Buffer<char> texData; y Buffer<uint> texTable; 11 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Material0 m_kd m_ior m_... m_kdTex m_iorTex m_bumpTex TextureTable . . . tex0 tex1 tex2 tex3 tex4 Texture0 m_header m_data Texture1 m_header m_data Texture2 m_header m_data Material1 m_kd m_ior m_... m_kdTex m_iorTex m_bumpTex . . . Texture3 m_header m_data
  12. 12. THINGS TO BE DONE WRITING OPENCL KERNEL CPU code OpenCL kernel for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 12 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void PtKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) return; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
  13. 13. IT WORKS BUT… y This approach is simple y But a lot of issues 13 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  14. 14. DRAWBACKS PERFORMANCE y Likely not u8lize hardware efficiently ‒ SIMD divergence ‒ GPU occupancy (latency) y Maintainability y Extendibility, Portability 14 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  15. 15. GPU ARCHITECTURE
  16. 16. OPENCL ON CPU y Processing element executes Work item 16 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 (thread) ‒ A SIMD lane (4*) y Compute unit executes Work group (thread group) ‒ A core (8*) ‒ # of processing elements != # of work items y Compute device executes Kernel (shader) ‒ A CPU ‒ # of compute units != # of work groups * On AMD FX-­‐8350 Work item Processing element Compute Unit Work group Kernel . . .
  17. 17. GPU VS CPU y Processing element executes Work item ‒ A SIMD lane (64*) y Compute unit executes Work group ‒ A SIMD engine (44x4*) ‒ # of processing elements != # of work items y Compute device executes Kernel ‒ A GPU ‒ # of compute units != # of work groups * On AMD Radeon R9 290X 17 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Work item GPU CPU Processing element Compute Unit (4) Work group Kernel . . . Processing element Compute Unit (64) ...
  18. 18. HIGH LEVEL DESCRIPTION y Today’s GPU is similar to a CPU (if you look at very high level) ‒ GPU is an extremely wide CPU ‒ Many cores ‒ Wide SIMD y AMD Radeon R9 290X GPU ‒ 176 = 44x4 SIMD engines (cores) ‒ 64 wide SIMD y But different in ‒ SIMD width (very wide) ‒ Limited local resources ‒ Strategy to hide latency y Knowing those are the key to exploit the performance 18 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  19. 19. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 19 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  20. 20. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 20 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  21. 21. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 21 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  22. 22. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 22 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  23. 23. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 23 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  24. 24. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 24 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  25. 25. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 25 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  26. 26. WIDE SIMD EXECUTION y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) L L L 26 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 J J L J
  27. 27. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) J J J 27 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 J J J J
  28. 28. LATENCY y Highest latency is from memory access y CPU prevent it by having larger cache ‒ Latency of cache access is small (fast) y Most of the memory access do not go to memory y CPU can run at full speed un8l a cache miss y # of concurrent execu8on on the GPU is far much larger than CPU ‒ More than 11k (= 44x4x64) work items y GPU cache is not large enough to absorb memory requests from those if they all requests different part of memory y Strategy ‒ Keep memory access as local as possible (not realis8c for prac8cal apps) ‒ Uses GPU mechanism for latency hiding 28 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  29. 29. GPU LATENCY HIDING y GPU can execute at full speed if there are only ALU instruc8ons (Inst. 0 -­‐ 2) * y Stalls on memory access instruc8on (Inst. 3) * Can hide latency using logical vector 29 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Inst. 0 Inst. 1 Inst. 2 Inst. 3 Lane0 LLaannee11 Lane2 Lane3 (MemAccess) Inst. 4
  30. 30. GPU LATENCY HIDING y When stalled, switch to another work group y Could fill the stall with instruc8ons from WG1 y A SIMD of GPU needs to process mul8ple WGs at the same 8me to hide latency (or maximize its throughput) 30 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 WG0: Inst. 0 WG0: Inst. 1 WG0: Inst. 2 WG0: Inst. 3 Lane0 LLaannee11 Lane2 Lane3 WG0: Inst. 4 WG1: Inst. 0 WG1: Inst. 1 WG1: Inst. 2 WG1: Inst. 3 Lane0 LLaannee11 Lane2 Lane3
  31. 31. HOW MANY WGS CAN WE EXECUTE PER SIMD y 10 wavefronts (64WIs) per SIMD is the max y It depends on local resource usage of the kernel y VGPR usage is ozen the problem y Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs y Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J 31 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4) y LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD
  32. 32. ADVICE TO REDUCE VGPR PRESSURE GET MORE PERFORMANCE FROM GPU y Don’t write a large kernel y If the program can be split into several pieces, split them into several kernels ‒ Single kernel approach ‒ VGPR usage of the kernel is 200 = max(60, 200, 10) ‒ 1 wavefront per SIMD ‒ Bad for latency hiding ‒ Mul8ple kernel approach ‒ FuncA: 4 wavefronts per SIMD ‒ FuncB: 1 wavefronts per SIMD ‒ FuncC: 10 wavefronts per SIMD ‒ FuncB is bad, but FuncA, FuncC runs fast y Helps compiler too 32 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Single Kernel FuncA (60VGPRS) FuncB (200VGPRS) FuncC(10VGPRS) Mul8ple Kernels FuncA (60VGPRS) FuncB (200VGPRS) FuncC(10VGPRS) L L L J L J
  33. 33. VLIW4 (NI), VLIW5 (EG) Scalar Architecture (SI, CI) Lane 1 Lane 2 Lane 3 Lane 4 33 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 W WHAT IS THE SCALAR ARCHITECTURE? y Best for vector computa8on y Low efficiency on scalar computa8on y Physical concurrent execu8on ‒ 16 work items ‒ 4 ALU opera8ons each ‒ Total: 16x4 ALU opera8ons y 1 SIMD is running in a CU y Difficult to fill xyzw y If not filled, we waste HW cycle y Good for both vector and scalar computa8on y Need more work groups to fill GPU y Physical concurrent execu8on ‒ 16x4 work items ‒ 1 ALU opera8on each ‒ Total: 16x4 ALU opera8ons y 4 SIMDs are running in a CU y 4x more work groups are necessary to fill HW X Y Z W . . . Lane 0 X Y Z W X Y Z W X Y Z W X Y Z W Lane 15 X Y Z W . . . SIMD0 SIMD0 SIMD1 SIMD3
  34. 34. ANOTHER SOLUTION y Spli{ng computa8on into mul8ple kernels ‒ Primary ray gen kernel ‒ Trace kernel ‒ Evaluate DI kernel ‒ Etc y BeXer HW u8liza8on ‒ Less divergence ‒ Higher HW occupancy 34 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  35. 35. OTHER BENEFITS? 35 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Ray Genera8on First Hit First Hit (Normal) Direct Illumina8on Indirect Illumina8on y Maintainability ‒ Debug is not as easy as we do on C, C++ ‒ Not all compilers are mature ‒ Can hit to a compiler bug, which is hard to debug ‒ Helps compiler ‒ By spli{ng kernels, we can isolate the issue ‒ If the code is developed by many people, this is important y Extendibility, Portability ‒ Easy to extend features ‒ Primary Ray Gen Kernel ‒ Add another camera projec8on ‒ Ray Cas8ng Kernel ‒ Easy to add another primi8ves (e.g., vector displacement) ‒ Take it out for physics ray cas8ng queries
  36. 36. PORTING TO OPENCL (SECOND ATTEMPT)
  37. 37. SPLITTING KERNELS TRANSFORMING CPU CODE Naïve CPU implementa<on Preparing for OpenCL implementa<on forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 37 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { forAll() PrimaryRayGenKernel(); while(1) { forAll() TraceKernel(); if( !any( hits ) ) break; forAll() SampleLightKernel(); forAll() TraceKernel(); forAll() AccumulateDIKernel(); forAll() SampleNextRayKernel(); } } Each for loop => A kernel execu8on
  38. 38. SPLITTING KERNELS CPU implementa<on Host code forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 38 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  39. 39. SPLITTING KERNELS Host Code Device Code { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } } 39 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void PrimaryRayGenKernel(); y Generate rays for all pixels in parallel __kernel void TraceKernel(); y Compute intersec8on for all rays in parallel __kernel void SampleLightKernel(); y Sample light for all hit points in parallel __kernel void AccumulateDItKernel(); y Accumulate DI for all hit points in parallel __kernel void SampleNextRayKernel(); y Generate bounced rays for all hit points in parallel
  40. 40. DESIGN LOCALIZE BRANCH Camera Type Brdf Type { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } } 40 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  41. 41. DESIGN RAY STATE y Cannot keep state between kernels y Ray state needs to be saved to/restored from global memory 41 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y Example ‒ Ray genera8on + ray direc8on visualiza8on __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); } struct RayState { float4 m_throughput; int2 m_randomNumber; int m_pixelIdx; };
  42. 42. DESIGN RAY STATE y Cannot keep state between kernels y Ray state needs to be saved to/restored from global memory In: pixelIdx In: pixelIdx In: pixelIdx Global memory (state) 42 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y Example ‒ Ray genera8on + ray direc8on visualiza8on __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); } Out: Ray Out: Ray Out: Ray In: pixelIdx Out: Ray PrimaryRayGenKernel In: Ray Out: Pixel color In: Ray Out: Pixel color In: Ray Out: Pixel color In: Ray Out: Pixel color VisualizeKernel
  43. 43. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy Primary (3/8, 1/8, 4/8) Secondary 43 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  44. 44. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) Primary Secondary y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8) Primary Secondary *When rays are created for all pixels, this is not necessary 44 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  45. 45. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8) *When rays are created for all pixels, this is not necessary 45 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y No need to write a compac8on kernel y Can compact using global atomics ‒ Prepare a counter (gRayCount) ‒ Perform atomic increment to reserve memory ‒ BeXer to do atomics in WG first, then do an atomic add per WG __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; } Primary Secondary Primary Secondary
  46. 46. DIRECT ILLUMINATION COMPUTATION y SampleLightKernel ‒ Want to keep the work uniform ‒ Different # of light sample per ray isn’t good ‒ Compute contribu8on from one point on a light y Simple approach ‒ Select a light ‒ Select a point on a light ‒ Compute DI without occlusion term y More sophis8cated light sampling ‒ Using poten8al contribu8on for PDF ‒ Forward+ style light culling 46 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; }
  47. 47. DIRECT ILLUMINATION COMPUTATION y SampleLightKernel y TraceRayKernel ‒ Check if the point on the light is visible or not ‒ Reuse code y AccumulateDIKernel ‒ If the ray is not blocked, accumulate the result 47 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; } __kernel void AccumulateDIKernel(__global ...) { Hit shadowHit = gShadowHit[GIDX]; float4 di = gDi[GIDX]; if( !shadowHit ) gFb[GIDX] += di; }
  48. 48. SAMPLE NEXT RAY y Compute next ray by sampling BRDF y Store ray and ray state 48 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleNextRayKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; Hit hit = gHit[GIDX]; if( !hit ) return; nextRay, s = Brdf_Sample( ray, s ); int dst = atom_inc( &gRayCount ); gRayNext[dst] = nextRay; gRayStateNext[dst] = s; }
  49. 49. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 49 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Mesh0 xform0 Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  50. 50. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf 50 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Top BVH Mesh0 xform0 Bohom BVH Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  51. 51. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf y Store those BVHs in a single memory ‒ Traverse top tree ‒ Hit a leaf, transform the ray into object space ‒ Traverse boXom tree ‒ On exit, transform the ray back to world space Bohom A Bohom B Bohom C Bohom D 51 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 root idx Top 3 4 5 6 Top BVH Mesh0 xform0 Bohom BVH Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  52. 52. SO FAR y Explained an OpenCL implementa8on of a simple path tracer y Easy to extend from here y Extension can be done by swapping one or two kernels ‒ Material system, Shader ‒ Light sampling ‒ Support for different type of primi8ves ‒ Ray caster + spa8al accelera8on structure 52 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  53. 53. ADVANCED TOPICS
  54. 54. INSTANCING y Powerful technique to increase geometric complexity y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform 54 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Top BVH Mesh0 xform0 Mesh0 xform1 Mesh0 xform2 Mesh1 xform3
  55. 55. INSTANCING y Powerful technique to increase geometric complexity y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform 55 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 Bohom A Bohom B Top 1 2 3 4 5 6 Top BVH Mesh0 xform0 Mesh0 xform1 Mesh0 xform2 Mesh1 xform3 Bohom BVH
  56. 56. LAYERED MATERIAL y Binary tree of BRDFs y Leaf node ‒ BRDF y Internal node ‒ Blend func8on ‒ Fresnel blend, Linear blend y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node 56 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Reflect Diffuse 0.5 0.5 Microfacet pdf=0.25 pdf=0.5 0.5 0.5 pdf=0.25
  57. 57. LAYERED MATERIAL y Binary tree of BRDFs y Leaf node ‒ BRDF y Internal node ‒ Blend func8on ‒ Fresnel blend, Linear blend y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node 57 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SelectBRDFKernel ); launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  58. 58. MORE ADVANCED TOPICS
  59. 59. VR y Latency is super important y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample 59 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  60. 60. VR y Latency is important y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample 60 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( VRPrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } launch( FillPixelKernel ); }
  61. 61. DISPLACEMENT MAPPING y Powerful technique to increase geometric complexity y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small y Direct ray tracing ‒ When hit a patch, tessellate and displace Base mesh Vector displacement map With vector displacement 61 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Fig. from hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
  62. 62. DISPLACEMENT MAPPING y Powerful technique to increase geometric complexity y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small y Direct ray tracing ‒ When hit a patch, tessellate and displace y To amor8ze tessella8on, displacement cost, batch ray intersec8on y Need to change TraceKernel 62 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  63. 63. DISPLACEMENT MAPPING y TraceKernel ‒ If a ray hit a quad with displacement map, save (ray, primi8ve) to a buffer ‒ Sort (ray, primi8ve) pairs by primi8ve index ‒ Process primi8ves in the list in parallel y For each patch ‒ Build quad BVH in parallel ‒ Cast rays in parallel y Key is work buffer memory alloca8on 63 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Level 0 (1 node) BVH Comt Ray Cast Ray Cast Ray Cast Ray Cast Level 2 (16 nodes) BVH Comt BVH Comt BVH Comt BVH Comt BVH Comt BVH Comt Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Level 1 (4 nodes) BVH Comt BVH Comt BVH Comt BVH Comt
  64. 64. VECTOR DISPLACEMENT IN ACTION Base mesh Vector displacement 64 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 52GB memory if pre tessella8on is used
  65. 65. OPEN SHADING LANGUAGE y OSL itself has nothing to do with OpenCL y Many use cases y Using OSL in OCL renderer ‒ Translate OSL to ‒ OCL kernel ‒ SPIR ‒ Feed those to OCL run8me ‒ clBuildProgram ‒ clCreateKernel 65 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y OSL example surface maXe [[ string descrip8on = "Lamber8an diffuse material" ]] (float Kd = 1 [[float UImin = 0, float UIsozmax = 1 ]], color Cs = 1 [[float UImin = 0, float UImax = 1 ]], string texname = “diffuse.tex” [[int texture_slot = 1]] ) {     Ci = Kd * Cs * noise(5.0 * P) * diffuse (N); }
  66. 66. SPIR y Standard Portable Intermediate Representa8on y Based on LLVM IR (32, 64) y Useful to ship OpenCL Apps y Device independent y OpenCL did not have usable binary code representa8on ‒ Binary for each device x driver ‒ Combina8on explode ‒ Embed kernel as string ‒ Load source, clCreateProgramWithSource ‒ Dump binary, clGetProgramInfo + CL_PROGRAM_BINARIES ‒ Load binary, clCreateProgramWithBinary y OpenCL implementa8on has to support cl_khr_spir extension ‒ Works on AMD, Intel (OpenCL 1.2) ‒ SPIR 2.0 is coming with OpenCL 2.0 66 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  67. 67. SPIR CREATE SPIR BINARY y Offline compiler ‒ clang-­‐spir* -­‐cc1 -­‐emit-­‐llvm-­‐bc -­‐triple spir-­‐unknown-­‐unknown -­‐cl-­‐spir-­‐compile-­‐op8ons ”-­‐x spir" -­‐include <opencl_spir.h> -­‐o <output> <input> ‒ clBuildProgram with “-­‐x spir -­‐spir-­‐std=CL1.2” y Use host OpenCL API ‒ clCompileProgram + Op8on ‒ clGetProgramInfo + CL_PROGRAM_BINARIES *hXps://github.com/KhronosGroup/SPIR 67 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  68. 68. 68 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014

×