SlideShare a Scribd company logo
1 of 68
Download to read offline
INTRODUCTION 
TO 
MONTE 
CARLO 
RAY 
TRACING 
OPENCL 
IMPLEMENTATION 
TAKAHIRO 
HARADA 
9/2014
RECAP 
OF 
LAST 
SESSION 
y Talked 
about 
theory 
y BRDFs 
‒ Reflec8on, 
Refrac8on, 
Diffuse, 
Microfacet 
y Fresnel 
is 
everywhere 
y Monte 
Carlo 
Ray 
Tracing 
‒ Intui8ve 
understanding 
of 
Monte 
Carlo 
Integra8on 
‒ Simple 
sampling 
(Random 
sampling) 
‒ BeXer 
sampling 
(Importance 
sampling) 
‒ Layered 
material 
hXp://www.slideshare.net/takahiroharada/introduc8on-­‐to-­‐monte-­‐carlo-­‐ray-­‐tracing-­‐cedec2013 
2 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Direct 
illumina<on 
for( 
i, 
j 
) 
{ 
ray 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
hit 
) 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit 
); 
} 
} 
3 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Indirect 
illumina<on 
for( 
i, 
j 
) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
!hit 
) 
break 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
} 
4 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
REVIEW 
SIMPLE 
CPU 
MC 
RAY 
TRACER 
Direct 
illumina<on 
Indirect 
Illumina<on 
for( 
i, 
j 
) 
{ 
ray 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
hit 
) 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit 
); 
} 
} 
5 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
for( 
i, 
j 
) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
!hit 
) 
break 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
}
COMPARISON 
Direct 
illumina<on 
Indirect 
illumina<on 
6 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
WHY 
OPENCL? 
y Speed! 
‒ GPU 
can 
accelerate 
it 
‒ Why? 
Faster 
is 
the 
beXer 
y OpenCL 
is 
an 
API 
for 
GPU 
compute 
y OpenCL 
is 
not 
only 
for 
graphics 
programmers 
y OpenCL 
does 
not 
always 
require 
a 
GPU 
‒ Runs 
on 
CPU 
too 
‒ Runs 
if 
there 
is 
a 
CPU 
(everywhere) 
y If 
renderer 
is 
wriXen 
in 
OpenCL, 
runs 
on 
Windows, 
Linux, 
MacOSX 
7 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
J
PORTING 
TO 
OPENCL 
(FIRST 
ATTEMPT)
THINGS 
TO 
BE 
DONE 
DATA 
STRUCTURE 
y No 
pointer 
in 
OpenCL* 
y Change 
pointer 
to 
index 
y Stored 
in 
a 
flat 
memory 
y Not 
suited 
for 
par8al 
update 
*Shared 
Virtual 
Memory 
(OpenCL 
2.0) 
9 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
10 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
THINGS 
TO 
BE 
DONE 
y Node 
data 
for 
a 
binary 
tree 
‒ Spa8al 
accelera8on 
structure 
(BVH) 
‒ Shading 
network 
y Buffer<NodeData> 
nodeData; 
DATA 
STRUCTURE 
Node 
Data 
m_max.x 
m_max.y 
m_max.z 
m_min.x 
m_min.y 
m_min.z 
m_child0 
m_child1 
Node 
Data 
m_max.x 
m_max.y 
m_max.z 
m_min.x 
m_min.y 
m_min.z 
m_child0 
m_child1 
Node 
Data 
m_max.x 
m_max.y 
m_max.z 
m_min.x 
m_min.y 
m_min.z 
m_child0 
m_child1
THINGS 
TO 
BE 
DONE 
DATA 
STRUCTURE 
y Material 
‒ Texture 
entry 
y Buffer<Material> 
material; 
y Buffer<char> 
texData; 
y Buffer<uint> 
texTable; 
11 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Material0 
m_kd 
m_ior 
m_... 
m_kdTex 
m_iorTex 
m_bumpTex 
TextureTable 
. 
. 
. 
tex0 
tex1 
tex2 
tex3 
tex4 
Texture0 
m_header 
m_data 
Texture1 
m_header 
m_data 
Texture2 
m_header 
m_data 
Material1 
m_kd 
m_ior 
m_... 
m_kdTex 
m_iorTex 
m_bumpTex 
. 
. 
. 
Texture3 
m_header 
m_data
THINGS 
TO 
BE 
DONE 
WRITING 
OPENCL 
KERNEL 
CPU 
code 
OpenCL 
kernel 
for( 
i, 
j 
) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if(!hit 
) 
break; 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
} 
12 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
__kernel 
void 
PtKernel(__global 
...) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if(!hit 
) 
return; 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
}
IT 
WORKS 
BUT… 
y This 
approach 
is 
simple 
y But 
a 
lot 
of 
issues 
13 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
DRAWBACKS 
PERFORMANCE 
y Likely 
not 
u8lize 
hardware 
efficiently 
‒ SIMD 
divergence 
‒ GPU 
occupancy 
(latency) 
y Maintainability 
y Extendibility, 
Portability 
14 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
GPU 
ARCHITECTURE
OPENCL 
ON 
CPU 
y Processing 
element 
executes 
Work 
item 
16 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
(thread) 
‒ A 
SIMD 
lane 
(4*) 
y Compute 
unit 
executes 
Work 
group 
(thread 
group) 
‒ A 
core 
(8*) 
‒ # 
of 
processing 
elements 
!= 
# 
of 
work 
items 
y Compute 
device 
executes 
Kernel 
(shader) 
‒ A 
CPU 
‒ # 
of 
compute 
units 
!= 
# 
of 
work 
groups 
* 
On 
AMD 
FX-­‐8350 
Work 
item 
Processing 
element 
Compute 
Unit 
Work 
group 
Kernel 
. 
. 
.
GPU 
VS 
CPU 
y Processing 
element 
executes 
Work 
item 
‒ A 
SIMD 
lane 
(64*) 
y Compute 
unit 
executes 
Work 
group 
‒ A 
SIMD 
engine 
(44x4*) 
‒ # 
of 
processing 
elements 
!= 
# 
of 
work 
items 
y Compute 
device 
executes 
Kernel 
‒ A 
GPU 
‒ # 
of 
compute 
units 
!= 
# 
of 
work 
groups 
* 
On 
AMD 
Radeon 
R9 
290X 
17 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Work 
item 
GPU 
CPU 
Processing 
element 
Compute 
Unit 
(4) 
Work 
group 
Kernel 
. 
. 
. 
Processing 
element 
Compute 
Unit 
(64) 
...
HIGH 
LEVEL 
DESCRIPTION 
y Today’s 
GPU 
is 
similar 
to 
a 
CPU 
(if 
you 
look 
at 
very 
high 
level) 
‒ GPU 
is 
an 
extremely 
wide 
CPU 
‒ Many 
cores 
‒ Wide 
SIMD 
y AMD 
Radeon 
R9 
290X 
GPU 
‒ 176 
= 
44x4 
SIMD 
engines 
(cores) 
‒ 64 
wide 
SIMD 
y But 
different 
in 
‒ SIMD 
width 
(very 
wide) 
‒ Limited 
local 
resources 
‒ Strategy 
to 
hide 
latency 
y Knowing 
those 
are 
the 
key 
to 
exploit 
the 
performance 
18 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
19 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
20 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
21 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
22 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
23 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
24 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
25 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7
WIDE 
SIMD 
EXECUTION 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
L 
L 
L 
26 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7 
J 
J 
L 
J
SIMD 
DIVERGENCE 
y SIMD 
execu8on 
= 
Program 
counter 
is 
shared 
among 
SIMD 
lanes 
y If 
it 
diverges 
in 
branches, 
HW 
u8liza8on 
decreases 
a 
lot 
(Gets 
easier 
to 
diverge 
on 
wide 
SIMD) 
J 
J 
J 
27 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
int 
funcA() 
{ 
int 
value 
= 
0; 
int 
a 
= 
computeA(); 
if( 
a 
== 
0 
) 
value 
= 
compute0(); 
else 
if( 
a 
== 
1 
) 
value 
= 
compute1(); 
else 
if( 
a 
== 
2 
) 
value 
= 
compute2(); 
else 
if( 
a 
== 
3 
) 
value 
= 
compute3(); 
return 
value; 
} 
Lane0 
Lane1 
Lane2 
Lane3 
Lane4 
Lane5 
Lane6 
Lane7 
J 
J 
J 
J
LATENCY 
y Highest 
latency 
is 
from 
memory 
access 
y CPU 
prevent 
it 
by 
having 
larger 
cache 
‒ Latency 
of 
cache 
access 
is 
small 
(fast) 
y Most 
of 
the 
memory 
access 
do 
not 
go 
to 
memory 
y CPU 
can 
run 
at 
full 
speed 
un8l 
a 
cache 
miss 
y # 
of 
concurrent 
execu8on 
on 
the 
GPU 
is 
far 
much 
larger 
than 
CPU 
‒ More 
than 
11k 
(= 
44x4x64) 
work 
items 
y GPU 
cache 
is 
not 
large 
enough 
to 
absorb 
memory 
requests 
from 
those 
if 
they 
all 
requests 
different 
part 
of 
memory 
y Strategy 
‒ Keep 
memory 
access 
as 
local 
as 
possible 
(not 
realis8c 
for 
prac8cal 
apps) 
‒ Uses 
GPU 
mechanism 
for 
latency 
hiding 
28 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
GPU 
LATENCY 
HIDING 
y GPU 
can 
execute 
at 
full 
speed 
if 
there 
are 
only 
ALU 
instruc8ons 
(Inst. 
0 
-­‐ 
2) 
* 
y Stalls 
on 
memory 
access 
instruc8on 
(Inst. 
3) 
* 
Can 
hide 
latency 
using 
logical 
vector 
29 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Inst. 
0 
Inst. 
1 
Inst. 
2 
Inst. 
3 
Lane0 
LLaannee11 
Lane2 
Lane3 
(MemAccess) 
Inst. 
4
GPU 
LATENCY 
HIDING 
y When 
stalled, 
switch 
to 
another 
work 
group 
y Could 
fill 
the 
stall 
with 
instruc8ons 
from 
WG1 
y A 
SIMD 
of 
GPU 
needs 
to 
process 
mul8ple 
WGs 
at 
the 
same 
8me 
to 
hide 
latency 
(or 
maximize 
its 
throughput) 
30 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
WG0: 
Inst. 
0 
WG0: 
Inst. 
1 
WG0: 
Inst. 
2 
WG0: 
Inst. 
3 
Lane0 
LLaannee11 
Lane2 
Lane3 
WG0: 
Inst. 
4 
WG1: 
Inst. 
0 
WG1: 
Inst. 
1 
WG1: 
Inst. 
2 
WG1: 
Inst. 
3 
Lane0 
LLaannee11 
Lane2 
Lane3
HOW 
MANY 
WGS 
CAN 
WE 
EXECUTE 
PER 
SIMD 
y 10 
wavefronts 
(64WIs) 
per 
SIMD 
is 
the 
max 
y It 
depends 
on 
local 
resource 
usage 
of 
the 
kernel 
y VGPR 
usage 
is 
ozen 
the 
problem 
y Share 
256 
VGPRs 
among 
n 
work 
groups 
‒ 1 
wavefront, 
256VGPRs 
LL 
‒ 2 
wavefronts, 
128VGPRs 
‒ 4 
wavefronts, 
64VGPRs 
J 
‒ 10 
wavefronts, 
25VGPRs 
y Share 
16KB 
LDS 
among 
n 
work 
groups 
‒ 1 
work 
group, 
16KB 
LL 
‒ 2 
work 
group, 
8KB 
‒ 4 
work 
group, 
4KB 
J 
31 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
y VGPRs 
‒ Registers 
used 
by 
vector 
ALUs 
‒ 64KB/SIMD 
‒ 256 
VGPRs/SIMD 
lane 
(= 
64KB/64/4) 
y LDS 
(Local 
data 
share) 
‒ 64KB/CU 
(CU 
== 
4SIMD) 
‒ 32KB/SIMD
ADVICE 
TO 
REDUCE 
VGPR 
PRESSURE 
GET 
MORE 
PERFORMANCE 
FROM 
GPU 
y Don’t 
write 
a 
large 
kernel 
y If 
the 
program 
can 
be 
split 
into 
several 
pieces, 
split 
them 
into 
several 
kernels 
‒ Single 
kernel 
approach 
‒ VGPR 
usage 
of 
the 
kernel 
is 
200 
= 
max(60, 
200, 
10) 
‒ 1 
wavefront 
per 
SIMD 
‒ Bad 
for 
latency 
hiding 
‒ Mul8ple 
kernel 
approach 
‒ FuncA: 
4 
wavefronts 
per 
SIMD 
‒ FuncB: 
1 
wavefronts 
per 
SIMD 
‒ FuncC: 
10 
wavefronts 
per 
SIMD 
‒ FuncB 
is 
bad, 
but 
FuncA, 
FuncC 
runs 
fast 
y Helps 
compiler 
too 
32 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Single 
Kernel 
FuncA 
(60VGPRS) 
FuncB 
(200VGPRS) 
FuncC(10VGPRS) 
Mul8ple 
Kernels 
FuncA 
(60VGPRS) 
FuncB 
(200VGPRS) 
FuncC(10VGPRS) 
L 
L 
L 
J 
L 
J
VLIW4 
(NI), 
VLIW5 
(EG) 
Scalar 
Architecture 
(SI, 
CI) 
Lane 
1 
Lane 
2 
Lane 
3 
Lane 
4 
33 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
W 
WHAT 
IS 
THE 
SCALAR 
ARCHITECTURE? 
y Best 
for 
vector 
computa8on 
y Low 
efficiency 
on 
scalar 
computa8on 
y Physical 
concurrent 
execu8on 
‒ 16 
work 
items 
‒ 4 
ALU 
opera8ons 
each 
‒ Total: 
16x4 
ALU 
opera8ons 
y 1 
SIMD 
is 
running 
in 
a 
CU 
y Difficult 
to 
fill 
xyzw 
y If 
not 
filled, 
we 
waste 
HW 
cycle 
y Good 
for 
both 
vector 
and 
scalar 
computa8on 
y Need 
more 
work 
groups 
to 
fill 
GPU 
y Physical 
concurrent 
execu8on 
‒ 16x4 
work 
items 
‒ 1 
ALU 
opera8on 
each 
‒ Total: 
16x4 
ALU 
opera8ons 
y 4 
SIMDs 
are 
running 
in 
a 
CU 
y 4x 
more 
work 
groups 
are 
necessary 
to 
fill 
HW 
X 
Y 
Z 
W . 
. 
. 
Lane 
0 
X 
Y 
Z 
W 
X 
Y 
Z 
W 
X 
Y 
Z 
W 
X 
Y 
Z 
W 
Lane 
15 
X 
Y 
Z 
W 
. 
. 
. 
SIMD0 
SIMD0 
SIMD1 
SIMD3
ANOTHER 
SOLUTION 
y Spli{ng 
computa8on 
into 
mul8ple 
kernels 
‒ Primary 
ray 
gen 
kernel 
‒ Trace 
kernel 
‒ Evaluate 
DI 
kernel 
‒ Etc 
y BeXer 
HW 
u8liza8on 
‒ Less 
divergence 
‒ Higher 
HW 
occupancy 
34 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
OTHER 
BENEFITS? 
35 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Ray 
Genera8on 
First 
Hit 
First 
Hit 
(Normal) 
Direct 
Illumina8on 
Indirect 
Illumina8on 
y Maintainability 
‒ Debug 
is 
not 
as 
easy 
as 
we 
do 
on 
C, 
C++ 
‒ Not 
all 
compilers 
are 
mature 
‒ Can 
hit 
to 
a 
compiler 
bug, 
which 
is 
hard 
to 
debug 
‒ Helps 
compiler 
‒ By 
spli{ng 
kernels, 
we 
can 
isolate 
the 
issue 
‒ If 
the 
code 
is 
developed 
by 
many 
people, 
this 
is 
important 
y Extendibility, 
Portability 
‒ Easy 
to 
extend 
features 
‒ Primary 
Ray 
Gen 
Kernel 
‒ Add 
another 
camera 
projec8on 
‒ Ray 
Cas8ng 
Kernel 
‒ Easy 
to 
add 
another 
primi8ves 
(e.g., 
vector 
displacement) 
‒ Take 
it 
out 
for 
physics 
ray 
cas8ng 
queries
PORTING 
TO 
OPENCL 
(SECOND 
ATTEMPT)
SPLITTING 
KERNELS 
TRANSFORMING 
CPU 
CODE 
Naïve 
CPU 
implementa<on 
Preparing 
for 
OpenCL 
implementa<on 
forAll() 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
!hit 
) 
break; 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
} 
37 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
{ 
forAll() 
PrimaryRayGenKernel(); 
while(1) 
{ 
forAll() 
TraceKernel(); 
if( 
!any( 
hits 
) 
) 
break; 
forAll() 
SampleLightKernel(); 
forAll() 
TraceKernel(); 
forAll() 
AccumulateDIKernel(); 
forAll() 
SampleNextRayKernel(); 
} 
} 
Each 
for 
loop 
=> 
A 
kernel 
execu8on
SPLITTING 
KERNELS 
CPU 
implementa<on 
Host 
code 
forAll() 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
while(1) 
{ 
hit 
= 
Trace( 
ray 
); 
if( 
!hit 
) 
break; 
d( 
pixelLoc 
) 
+= 
EvaluateDI( 
ray, 
hit, 
rayState 
); 
ray, 
rayState 
= 
sampleNextRay( 
ray, 
hit 
); 
} 
} 
38 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
}
SPLITTING 
KERNELS 
Host 
Code 
Device 
Code 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
} 
39 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
__kernel 
void 
PrimaryRayGenKernel(); 
y Generate 
rays 
for 
all 
pixels 
in 
parallel 
__kernel 
void 
TraceKernel(); 
y Compute 
intersec8on 
for 
all 
rays 
in 
parallel 
__kernel 
void 
SampleLightKernel(); 
y Sample 
light 
for 
all 
hit 
points 
in 
parallel 
__kernel 
void 
AccumulateDItKernel(); 
y Accumulate 
DI 
for 
all 
hit 
points 
in 
parallel 
__kernel 
void 
SampleNextRayKernel(); 
y Generate 
bounced 
rays 
for 
all 
hit 
points 
in 
parallel
DESIGN 
LOCALIZE 
BRANCH 
Camera 
Type 
Brdf 
Type 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
} 
40 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
}
DESIGN 
RAY 
STATE 
y Cannot 
keep 
state 
between 
kernels 
y Ray 
state 
needs 
to 
be 
saved 
to/restored 
from 
global 
memory 
41 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
y Example 
‒ Ray 
genera8on 
+ 
ray 
direc8on 
visualiza8on 
__kernel 
void 
PrimaryRayGenKernel(__global 
...) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
int 
dst 
= 
atom_inc( 
&gRayCount 
); 
gRay[dst] 
= 
ray; 
gRayState[dst] 
= 
rayState; 
// 
save 
} 
__kernel 
void 
VisualizeRayKernel(__global 
...) 
{ 
RayState 
s 
= 
gRayState[get_global_id(0)]; 
// 
restore 
Ray 
ray 
= 
gRay[get_global_id(0)]; 
gFb[s.m_pixelIdx] 
= 
Ray_getDir( 
ray 
); 
} 
struct 
RayState 
{ 
float4 
m_throughput; 
int2 
m_randomNumber; 
int 
m_pixelIdx; 
};
DESIGN 
RAY 
STATE 
y Cannot 
keep 
state 
between 
kernels 
y Ray 
state 
needs 
to 
be 
saved 
to/restored 
from 
global 
memory 
In: 
pixelIdx 
In: 
pixelIdx 
In: 
pixelIdx 
Global 
memory 
(state) 
42 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
y Example 
‒ Ray 
genera8on 
+ 
ray 
direc8on 
visualiza8on 
__kernel 
void 
PrimaryRayGenKernel(__global 
...) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
int 
dst 
= 
atom_inc( 
&gRayCount 
); 
gRay[dst] 
= 
ray; 
gRayState[dst] 
= 
rayState; 
// 
save 
} 
__kernel 
void 
VisualizeRayKernel(__global 
...) 
{ 
RayState 
s 
= 
gRayState[get_global_id(0)]; 
// 
restore 
Ray 
ray 
= 
gRay[get_global_id(0)]; 
gFb[s.m_pixelIdx] 
= 
Ray_getDir( 
ray 
); 
} 
Out: 
Ray 
Out: 
Ray 
Out: 
Ray 
In: 
pixelIdx 
Out: 
Ray 
PrimaryRayGenKernel 
In: 
Ray 
Out: 
Pixel 
color 
In: 
Ray 
Out: 
Pixel 
color 
In: 
Ray 
Out: 
Pixel 
color 
In: 
Ray 
Out: 
Pixel 
color 
VisualizeKernel
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
Primary 
(3/8, 
1/8, 
4/8) 
Secondary 
43 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
(3/8, 
1/8, 
4/8) 
Primary 
Secondary 
y With 
compac8on 
‒ 1 
SIMD 
execu8on 
‒ Occupancy 
(7/8) 
Primary 
Secondary 
*When 
rays 
are 
created 
for 
all 
pixels, 
this 
is 
not 
necessary 
44 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
RAY 
COMPACTION 
y Sparse 
data 
lowers 
SIMD 
u8liza8on 
y Without 
compac8on 
‒ 3 
SIMD 
execu8ons 
‒ Occupancy 
(3/8, 
1/8, 
4/8) 
y With 
compac8on 
‒ 1 
SIMD 
execu8on 
‒ Occupancy 
(7/8) 
*When 
rays 
are 
created 
for 
all 
pixels, 
this 
is 
not 
necessary 
45 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
y No 
need 
to 
write 
a 
compac8on 
kernel 
y Can 
compact 
using 
global 
atomics 
‒ Prepare 
a 
counter 
(gRayCount) 
‒ Perform 
atomic 
increment 
to 
reserve 
memory 
‒ BeXer 
to 
do 
atomics 
in 
WG 
first, 
then 
do 
an 
atomic 
add 
per 
WG 
__kernel 
void 
PrimaryRayGenKernel(__global 
...) 
{ 
ray, 
rayState 
= 
PrimaryRayGen( 
camera, 
pixelLoc 
); 
int 
dst 
= 
atom_inc( 
&gRayCount 
); 
gRay[dst] 
= 
ray; 
gRayState[dst] 
= 
rayState; 
} 
Primary 
Secondary 
Primary 
Secondary
DIRECT 
ILLUMINATION 
COMPUTATION 
y SampleLightKernel 
‒ Want 
to 
keep 
the 
work 
uniform 
‒ Different 
# 
of 
light 
sample 
per 
ray 
isn’t 
good 
‒ Compute 
contribu8on 
from 
one 
point 
on 
a 
light 
y Simple 
approach 
‒ Select 
a 
light 
‒ Select 
a 
point 
on 
a 
light 
‒ Compute 
DI 
without 
occlusion 
term 
y More 
sophis8cated 
light 
sampling 
‒ Using 
poten8al 
contribu8on 
for 
PDF 
‒ Forward+ 
style 
light 
culling 
46 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
__kernel 
void 
SampleLightKernel(__global 
...) 
{ 
RayState 
s 
= 
gRayState[GIDX]; 
Ray 
ray 
= 
gRay[GIDX]; 
shadowRay, 
lfnDotV 
= 
Light_Sample( 
ray, 
s 
); 
gShadowRay[GIDX] 
= 
shadowRay; 
gDi[GIDX] 
= 
lfnDotV; 
gRayState[GIDX] 
= 
s; 
}
DIRECT 
ILLUMINATION 
COMPUTATION 
y SampleLightKernel 
y TraceRayKernel 
‒ Check 
if 
the 
point 
on 
the 
light 
is 
visible 
or 
not 
‒ Reuse 
code 
y AccumulateDIKernel 
‒ If 
the 
ray 
is 
not 
blocked, 
accumulate 
the 
result 
47 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
__kernel 
void 
SampleLightKernel(__global 
...) 
{ 
RayState 
s 
= 
gRayState[GIDX]; 
Ray 
ray 
= 
gRay[GIDX]; 
shadowRay, 
lfnDotV 
= 
Light_Sample( 
ray, 
s 
); 
gShadowRay[GIDX] 
= 
shadowRay; 
gDi[GIDX] 
= 
lfnDotV; 
gRayState[GIDX] 
= 
s; 
} 
__kernel 
void 
AccumulateDIKernel(__global 
...) 
{ 
Hit 
shadowHit 
= 
gShadowHit[GIDX]; 
float4 
di 
= 
gDi[GIDX]; 
if( 
!shadowHit 
) 
gFb[GIDX] 
+= 
di; 
}
SAMPLE 
NEXT 
RAY 
y Compute 
next 
ray 
by 
sampling 
BRDF 
y Store 
ray 
and 
ray 
state 
48 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
__kernel 
void 
SampleNextRayKernel(__global 
...) 
{ 
RayState 
s 
= 
gRayState[GIDX]; 
Ray 
ray 
= 
gRay[GIDX]; 
Hit 
hit 
= 
gHit[GIDX]; 
if( 
!hit 
) 
return; 
nextRay, 
s 
= 
Brdf_Sample( 
ray, 
s 
); 
int 
dst 
= 
atom_inc( 
&gRayCount 
); 
gRayNext[dst] 
= 
nextRay; 
gRayStateNext[dst] 
= 
s; 
}
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
pointer) 
0 1 2 3 4 5 6 7 8 9 
49 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
0 
1 
2 
3 
4 
5 
6 
Mesh0 
xform0 
Mesh1 
xform1 
Mesh2 
xform2 
Mesh3 
xform3
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
pointer) 
0 1 2 3 4 5 6 7 8 9 
y 2 
level 
BVH 
‒ Top: 
stores 
an 
object 
in 
a 
leaf 
(object 
index, 
transform) 
‒ BoXom: 
stores 
a 
primi8ve 
(triangle, 
quad) 
in 
a 
leaf 
50 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
0 
1 
2 
3 
4 
5 
6 
Top 
BVH 
Mesh0 
xform0 
Bohom 
BVH 
Mesh1 
xform1 
Mesh2 
xform2 
Mesh3 
xform3
TRACE 
KERNEL 
y BVH 
is 
used 
for 
accelera8on 
structure 
‒ Index 
is 
used 
to 
describe 
hierarchy 
structure 
(no 
pointer) 
0 1 2 3 4 5 6 7 8 9 
y 2 
level 
BVH 
‒ Top: 
stores 
an 
object 
in 
a 
leaf 
(object 
index, 
transform) 
‒ BoXom: 
stores 
a 
primi8ve 
(triangle, 
quad) 
in 
a 
leaf 
y Store 
those 
BVHs 
in 
a 
single 
memory 
‒ Traverse 
top 
tree 
‒ Hit 
a 
leaf, 
transform 
the 
ray 
into 
object 
space 
‒ Traverse 
boXom 
tree 
‒ On 
exit, 
transform 
the 
ray 
back 
to 
world 
space 
Bohom 
A 
Bohom 
B 
Bohom 
C 
Bohom 
D 
51 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
0 
1 
2 
root 
idx 
Top 
3 
4 
5 
6 
Top 
BVH 
Mesh0 
xform0 
Bohom 
BVH 
Mesh1 
xform1 
Mesh2 
xform2 
Mesh3 
xform3
SO 
FAR 
y Explained 
an 
OpenCL 
implementa8on 
of 
a 
simple 
path 
tracer 
y Easy 
to 
extend 
from 
here 
y Extension 
can 
be 
done 
by 
swapping 
one 
or 
two 
kernels 
‒ Material 
system, 
Shader 
‒ Light 
sampling 
‒ Support 
for 
different 
type 
of 
primi8ves 
‒ Ray 
caster 
+ 
spa8al 
accelera8on 
structure 
52 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
ADVANCED 
TOPICS
INSTANCING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Small 
memory 
overhead 
‒ Shares 
geometric 
informa8on 
(vertex, 
normal 
etc) 
‒ Shares 
BVH 
‒ Stores 
object 
transform 
54 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
0 
1 
2 
3 
4 
5 
6 
Top 
BVH 
Mesh0 
xform0 
Mesh0 
xform1 
Mesh0 
xform2 
Mesh1 
xform3
INSTANCING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Small 
memory 
overhead 
‒ Shares 
geometric 
informa8on 
(vertex, 
normal 
etc) 
‒ Shares 
BVH 
‒ Stores 
object 
transform 
55 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
0 
Bohom 
A 
Bohom 
B 
Top 
1 
2 
3 
4 
5 
6 
Top 
BVH 
Mesh0 
xform0 
Mesh0 
xform1 
Mesh0 
xform2 
Mesh1 
xform3 
Bohom 
BVH
LAYERED 
MATERIAL 
y Binary 
tree 
of 
BRDFs 
y Leaf 
node 
‒ BRDF 
y Internal 
node 
‒ Blend 
func8on 
‒ Fresnel 
blend, 
Linear 
blend 
y Evaluate 
one 
BRDF 
at 
a 
8me 
‒ Traverse 
binary 
tree 
‒ Random 
sampling 
at 
internal 
node 
56 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Reflect 
Diffuse 
0.5 
0.5 
Microfacet 
pdf=0.25 
pdf=0.5 
0.5 
0.5 
pdf=0.25
LAYERED 
MATERIAL 
y Binary 
tree 
of 
BRDFs 
y Leaf 
node 
‒ BRDF 
y Internal 
node 
‒ Blend 
func8on 
‒ Fresnel 
blend, 
Linear 
blend 
y Evaluate 
one 
BRDF 
at 
a 
8me 
‒ Traverse 
binary 
tree 
‒ Random 
sampling 
at 
internal 
node 
57 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
{ 
launch( 
PrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SelectBRDFKernel 
); 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
}
MORE 
ADVANCED 
TOPICS
VR 
y Latency 
is 
super 
important 
y To 
improve 
a 
frame 
rendering 
8me, 
‒ Used 
mul8ple 
GPUs 
‒ Foveated 
rendering 
y More 
than 
60fps 
on 
4 
Hawaii 
GPUs 
‒ 6M 
triangles 
‒ 32 
shadow 
rays/sample 
‒ 2 
AA 
rays/sample 
59 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
VR 
y Latency 
is 
important 
y To 
improve 
a 
frame 
rendering 
8me, 
‒ Used 
mul8ple 
GPUs 
‒ Foveated 
rendering 
y More 
than 
60fps 
on 
4 
Hawaii 
GPUs 
‒ 6M 
triangles 
‒ 32 
shadow 
rays/sample 
‒ 2 
AA 
rays/sample 
60 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
{ 
launch( 
VRPrimaryRayGenKernel 
); 
while(1) 
{ 
launch( 
TraceKernel 
); 
if( 
!any( 
hits 
) 
) 
break; 
launch( 
SampleLightKernel 
); 
launch( 
TraceKernel 
); 
launch( 
AccumulateDIKernel 
); 
launch( 
SampleNextRayKernel 
); 
} 
launch( 
FillPixelKernel 
); 
}
DISPLACEMENT 
MAPPING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Pre 
tessella8on 
‒ Required 
memory 
is 
too 
large 
‒ GPU 
memory 
is 
too 
small 
y Direct 
ray 
tracing 
‒ When 
hit 
a 
patch, 
tessellate 
and 
displace 
Base 
mesh 
Vector 
displacement 
map 
With 
vector 
displacement 
61 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Fig. 
from 
hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
DISPLACEMENT 
MAPPING 
y Powerful 
technique 
to 
increase 
geometric 
complexity 
y Pre 
tessella8on 
‒ Required 
memory 
is 
too 
large 
‒ GPU 
memory 
is 
too 
small 
y Direct 
ray 
tracing 
‒ When 
hit 
a 
patch, 
tessellate 
and 
displace 
y To 
amor8ze 
tessella8on, 
displacement 
cost, 
batch 
ray 
intersec8on 
y Need 
to 
change 
TraceKernel 
62 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
DISPLACEMENT 
MAPPING 
y TraceKernel 
‒ If 
a 
ray 
hit 
a 
quad 
with 
displacement 
map, 
save 
(ray, 
primi8ve) 
to 
a 
buffer 
‒ Sort 
(ray, 
primi8ve) 
pairs 
by 
primi8ve 
index 
‒ Process 
primi8ves 
in 
the 
list 
in 
parallel 
y For 
each 
patch 
‒ Build 
quad 
BVH 
in 
parallel 
‒ Cast 
rays 
in 
parallel 
y Key 
is 
work 
buffer 
memory 
alloca8on 
63 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
Level 
0 
(1 
node) 
BVH 
Comt 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Level 
2 
(16 
nodes) 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Ray 
Cast 
Level 
1 
(4 
nodes) 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt 
BVH 
Comt
VECTOR 
DISPLACEMENT 
IN 
ACTION 
Base 
mesh 
Vector 
displacement 
64 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
52GB 
memory 
if 
pre 
tessella8on 
is 
used
OPEN 
SHADING 
LANGUAGE 
y OSL 
itself 
has 
nothing 
to 
do 
with 
OpenCL 
y Many 
use 
cases 
y Using 
OSL 
in 
OCL 
renderer 
‒ Translate 
OSL 
to 
‒ OCL 
kernel 
‒ SPIR 
‒ Feed 
those 
to 
OCL 
run8me 
‒ clBuildProgram 
‒ clCreateKernel 
65 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014 
y OSL 
example 
surface 
maXe 
[[ 
string 
descrip8on 
= 
"Lamber8an 
diffuse 
material" 
]] 
(float 
Kd 
= 
1 
[[float 
UImin 
= 
0, 
float 
UIsozmax 
= 
1 
]], 
color 
Cs 
= 
1 
[[float 
UImin 
= 
0, 
float 
UImax 
= 
1 
]], 
string 
texname 
= 
“diffuse.tex” 
[[int 
texture_slot 
= 
1]] ) 
{ 
    Ci 
= 
Kd 
* 
Cs 
* 
noise(5.0 
* 
P) 
* 
diffuse 
(N); 
}
SPIR 
y Standard 
Portable 
Intermediate 
Representa8on 
y Based 
on 
LLVM 
IR 
(32, 
64) 
y Useful 
to 
ship 
OpenCL 
Apps 
y Device 
independent 
y OpenCL 
did 
not 
have 
usable 
binary 
code 
representa8on 
‒ Binary 
for 
each 
device 
x 
driver 
‒ Combina8on 
explode 
‒ Embed 
kernel 
as 
string 
‒ Load 
source, 
clCreateProgramWithSource 
‒ Dump 
binary, 
clGetProgramInfo 
+ 
CL_PROGRAM_BINARIES 
‒ Load 
binary, 
clCreateProgramWithBinary 
y OpenCL 
implementa8on 
has 
to 
support 
cl_khr_spir 
extension 
‒ Works 
on 
AMD, 
Intel 
(OpenCL 
1.2) 
‒ SPIR 
2.0 
is 
coming 
with 
OpenCL 
2.0 
66 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
SPIR 
CREATE 
SPIR 
BINARY 
y Offline 
compiler 
‒ clang-­‐spir* 
-­‐cc1 
-­‐emit-­‐llvm-­‐bc 
-­‐triple 
spir-­‐unknown-­‐unknown 
-­‐cl-­‐spir-­‐compile-­‐op8ons 
”-­‐x 
spir" 
-­‐include 
<opencl_spir.h> 
-­‐o 
<output> 
<input> 
‒ clBuildProgram 
with 
“-­‐x 
spir 
-­‐spir-­‐std=CL1.2” 
y Use 
host 
OpenCL 
API 
‒ clCompileProgram 
+ 
Op8on 
‒ clGetProgramInfo 
+ 
CL_PROGRAM_BINARIES 
*hXps://github.com/KhronosGroup/SPIR 
67 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014
68 
| 
Introduc8on 
to 
Monte 
Carlo 
Ray 
Tracing 
OpenCL 
implementa8on 
| 
SEPT 
3, 
2014

More Related Content

What's hot

Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunElectronic Arts / DICE
 
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...Codemotion
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingElectronic Arts / DICE
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)Takahiro Harada
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3guest11b095
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingElectronic Arts / DICE
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingumsl snfrzb
 
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open ProblemsHPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open ProblemsElectronic Arts / DICE
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The SurgeMichele Giacalone
 
Hable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr LightingHable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr Lightingozlael ozlael
 
Rendering Tech of Space Marine
Rendering Tech of Space MarineRendering Tech of Space Marine
Rendering Tech of Space MarinePope Kim
 
Lighting of Killzone: Shadow Fall
Lighting of Killzone: Shadow FallLighting of Killzone: Shadow Fall
Lighting of Killzone: Shadow FallGuerrilla
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnGuerrilla
 

What's hot (20)

Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
mssao presentation
mssao presentationmssao presentation
mssao presentation
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Light prepass
Light prepassLight prepass
Light prepass
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processing
 
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open ProblemsHPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
HPG 2018 - Game Ray Tracing: State-of-the-Art and Open Problems
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Hable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr LightingHable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr Lighting
 
Rendering Tech of Space Marine
Rendering Tech of Space MarineRendering Tech of Space Marine
Rendering Tech of Space Marine
 
Lighting of Killzone: Shadow Fall
Lighting of Killzone: Shadow FallLighting of Killzone: Shadow Fall
Lighting of Killzone: Shadow Fall
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 

Similar to Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
Plan_design and FPGA implement of MIMO OFDM SDM systems
Plan_design and FPGA implement of MIMO OFDM SDM systemsPlan_design and FPGA implement of MIMO OFDM SDM systems
Plan_design and FPGA implement of MIMO OFDM SDM systemsTan Vo
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
IRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing AlgorithmsIRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing AlgorithmsIRJET Journal
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Transport SDN & OpenDaylight Use Cases in Korea
Transport SDN & OpenDaylight Use Cases in KoreaTransport SDN & OpenDaylight Use Cases in Korea
Transport SDN & OpenDaylight Use Cases in KoreaJustin Park
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...RISC-V International
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET Journal
 
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...IRJET Journal
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsVivek Venugopalan
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...aaajjj4
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...IRJET Journal
 
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...IRJET Journal
 

Similar to Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014) (20)

Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
Cadancesimulation
CadancesimulationCadancesimulation
Cadancesimulation
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
Plan_design and FPGA implement of MIMO OFDM SDM systems
Plan_design and FPGA implement of MIMO OFDM SDM systemsPlan_design and FPGA implement of MIMO OFDM SDM systems
Plan_design and FPGA implement of MIMO OFDM SDM systems
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
IRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing AlgorithmsIRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing Algorithms
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Transport SDN & OpenDaylight Use Cases in Korea
Transport SDN & OpenDaylight Use Cases in KoreaTransport SDN & OpenDaylight Use Cases in Korea
Transport SDN & OpenDaylight Use Cases in Korea
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...IRJET-  	  Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
IRJET- Implementation of Reversible Radix-2 FFT VLSI Architecture using P...
 
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
 
4g lte matlab
4g lte matlab4g lte matlab
4g lte matlab
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
 
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...
Improvement in Computational Complexity of the MIMO ML Decoder in High Mobili...
 

More from Takahiro Harada

201907 Radeon ProRender2.0@Siggraph2019
201907 Radeon ProRender2.0@Siggraph2019201907 Radeon ProRender2.0@Siggraph2019
201907 Radeon ProRender2.0@Siggraph2019Takahiro Harada
 
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada
 
Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)Takahiro Harada
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada
 
確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)Takahiro Harada
 
Introducing Firerender for 3DS Max
Introducing Firerender for 3DS MaxIntroducing Firerender for 3DS Max
Introducing Firerender for 3DS MaxTakahiro Harada
 
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRaysTakahiro Harada
 
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...Takahiro Harada
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada
 
Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)Takahiro Harada
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Takahiro Harada
 
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)Takahiro Harada
 
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)Takahiro Harada
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Takahiro Harada
 

More from Takahiro Harada (15)

201907 Radeon ProRender2.0@Siggraph2019
201907 Radeon ProRender2.0@Siggraph2019201907 Radeon ProRender2.0@Siggraph2019
201907 Radeon ProRender2.0@Siggraph2019
 
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
 
Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
 
確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)
 
Introducing Firerender for 3DS Max
Introducing Firerender for 3DS MaxIntroducing Firerender for 3DS Max
Introducing Firerender for 3DS Max
 
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
 
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
 
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)
Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)
 
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

  • 1. INTRODUCTION TO MONTE CARLO RAY TRACING OPENCL IMPLEMENTATION TAKAHIRO HARADA 9/2014
  • 2. RECAP OF LAST SESSION y Talked about theory y BRDFs ‒ Reflec8on, Refrac8on, Diffuse, Microfacet y Fresnel is everywhere y Monte Carlo Ray Tracing ‒ Intui8ve understanding of Monte Carlo Integra8on ‒ Simple sampling (Random sampling) ‒ BeXer sampling (Importance sampling) ‒ Layered material hXp://www.slideshare.net/takahiroharada/introduc8on-­‐to-­‐monte-­‐carlo-­‐ray-­‐tracing-­‐cedec2013 2 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 3. REVIEW SIMPLE CPU MC RAY TRACER Direct illumina<on for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } } 3 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 4. REVIEW SIMPLE CPU MC RAY TRACER Indirect illumina<on for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 4 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 5. REVIEW SIMPLE CPU MC RAY TRACER Direct illumina<on Indirect Illumina<on for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } } 5 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
  • 6. COMPARISON Direct illumina<on Indirect illumina<on 6 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 7. WHY OPENCL? y Speed! ‒ GPU can accelerate it ‒ Why? Faster is the beXer y OpenCL is an API for GPU compute y OpenCL is not only for graphics programmers y OpenCL does not always require a GPU ‒ Runs on CPU too ‒ Runs if there is a CPU (everywhere) y If renderer is wriXen in OpenCL, runs on Windows, Linux, MacOSX 7 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 J
  • 8. PORTING TO OPENCL (FIRST ATTEMPT)
  • 9. THINGS TO BE DONE DATA STRUCTURE y No pointer in OpenCL* y Change pointer to index y Stored in a flat memory y Not suited for par8al update *Shared Virtual Memory (OpenCL 2.0) 9 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 10. 10 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 THINGS TO BE DONE y Node data for a binary tree ‒ Spa8al accelera8on structure (BVH) ‒ Shading network y Buffer<NodeData> nodeData; DATA STRUCTURE Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1 Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1 Node Data m_max.x m_max.y m_max.z m_min.x m_min.y m_min.z m_child0 m_child1
  • 11. THINGS TO BE DONE DATA STRUCTURE y Material ‒ Texture entry y Buffer<Material> material; y Buffer<char> texData; y Buffer<uint> texTable; 11 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Material0 m_kd m_ior m_... m_kdTex m_iorTex m_bumpTex TextureTable . . . tex0 tex1 tex2 tex3 tex4 Texture0 m_header m_data Texture1 m_header m_data Texture2 m_header m_data Material1 m_kd m_ior m_... m_kdTex m_iorTex m_bumpTex . . . Texture3 m_header m_data
  • 12. THINGS TO BE DONE WRITING OPENCL KERNEL CPU code OpenCL kernel for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 12 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void PtKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) return; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
  • 13. IT WORKS BUT… y This approach is simple y But a lot of issues 13 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 14. DRAWBACKS PERFORMANCE y Likely not u8lize hardware efficiently ‒ SIMD divergence ‒ GPU occupancy (latency) y Maintainability y Extendibility, Portability 14 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 16. OPENCL ON CPU y Processing element executes Work item 16 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 (thread) ‒ A SIMD lane (4*) y Compute unit executes Work group (thread group) ‒ A core (8*) ‒ # of processing elements != # of work items y Compute device executes Kernel (shader) ‒ A CPU ‒ # of compute units != # of work groups * On AMD FX-­‐8350 Work item Processing element Compute Unit Work group Kernel . . .
  • 17. GPU VS CPU y Processing element executes Work item ‒ A SIMD lane (64*) y Compute unit executes Work group ‒ A SIMD engine (44x4*) ‒ # of processing elements != # of work items y Compute device executes Kernel ‒ A GPU ‒ # of compute units != # of work groups * On AMD Radeon R9 290X 17 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Work item GPU CPU Processing element Compute Unit (4) Work group Kernel . . . Processing element Compute Unit (64) ...
  • 18. HIGH LEVEL DESCRIPTION y Today’s GPU is similar to a CPU (if you look at very high level) ‒ GPU is an extremely wide CPU ‒ Many cores ‒ Wide SIMD y AMD Radeon R9 290X GPU ‒ 176 = 44x4 SIMD engines (cores) ‒ 64 wide SIMD y But different in ‒ SIMD width (very wide) ‒ Limited local resources ‒ Strategy to hide latency y Knowing those are the key to exploit the performance 18 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 19. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 19 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 20. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 20 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 21. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 21 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 22. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 22 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 23. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 23 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 24. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 24 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 25. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) 25 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
  • 26. WIDE SIMD EXECUTION y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) L L L 26 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 J J L J
  • 27. SIMD DIVERGENCE y SIMD execu8on = Program counter is shared among SIMD lanes y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD) J J J 27 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; } Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 J J J J
  • 28. LATENCY y Highest latency is from memory access y CPU prevent it by having larger cache ‒ Latency of cache access is small (fast) y Most of the memory access do not go to memory y CPU can run at full speed un8l a cache miss y # of concurrent execu8on on the GPU is far much larger than CPU ‒ More than 11k (= 44x4x64) work items y GPU cache is not large enough to absorb memory requests from those if they all requests different part of memory y Strategy ‒ Keep memory access as local as possible (not realis8c for prac8cal apps) ‒ Uses GPU mechanism for latency hiding 28 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 29. GPU LATENCY HIDING y GPU can execute at full speed if there are only ALU instruc8ons (Inst. 0 -­‐ 2) * y Stalls on memory access instruc8on (Inst. 3) * Can hide latency using logical vector 29 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Inst. 0 Inst. 1 Inst. 2 Inst. 3 Lane0 LLaannee11 Lane2 Lane3 (MemAccess) Inst. 4
  • 30. GPU LATENCY HIDING y When stalled, switch to another work group y Could fill the stall with instruc8ons from WG1 y A SIMD of GPU needs to process mul8ple WGs at the same 8me to hide latency (or maximize its throughput) 30 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 WG0: Inst. 0 WG0: Inst. 1 WG0: Inst. 2 WG0: Inst. 3 Lane0 LLaannee11 Lane2 Lane3 WG0: Inst. 4 WG1: Inst. 0 WG1: Inst. 1 WG1: Inst. 2 WG1: Inst. 3 Lane0 LLaannee11 Lane2 Lane3
  • 31. HOW MANY WGS CAN WE EXECUTE PER SIMD y 10 wavefronts (64WIs) per SIMD is the max y It depends on local resource usage of the kernel y VGPR usage is ozen the problem y Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs y Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J 31 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4) y LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD
  • 32. ADVICE TO REDUCE VGPR PRESSURE GET MORE PERFORMANCE FROM GPU y Don’t write a large kernel y If the program can be split into several pieces, split them into several kernels ‒ Single kernel approach ‒ VGPR usage of the kernel is 200 = max(60, 200, 10) ‒ 1 wavefront per SIMD ‒ Bad for latency hiding ‒ Mul8ple kernel approach ‒ FuncA: 4 wavefronts per SIMD ‒ FuncB: 1 wavefronts per SIMD ‒ FuncC: 10 wavefronts per SIMD ‒ FuncB is bad, but FuncA, FuncC runs fast y Helps compiler too 32 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Single Kernel FuncA (60VGPRS) FuncB (200VGPRS) FuncC(10VGPRS) Mul8ple Kernels FuncA (60VGPRS) FuncB (200VGPRS) FuncC(10VGPRS) L L L J L J
  • 33. VLIW4 (NI), VLIW5 (EG) Scalar Architecture (SI, CI) Lane 1 Lane 2 Lane 3 Lane 4 33 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 W WHAT IS THE SCALAR ARCHITECTURE? y Best for vector computa8on y Low efficiency on scalar computa8on y Physical concurrent execu8on ‒ 16 work items ‒ 4 ALU opera8ons each ‒ Total: 16x4 ALU opera8ons y 1 SIMD is running in a CU y Difficult to fill xyzw y If not filled, we waste HW cycle y Good for both vector and scalar computa8on y Need more work groups to fill GPU y Physical concurrent execu8on ‒ 16x4 work items ‒ 1 ALU opera8on each ‒ Total: 16x4 ALU opera8ons y 4 SIMDs are running in a CU y 4x more work groups are necessary to fill HW X Y Z W . . . Lane 0 X Y Z W X Y Z W X Y Z W X Y Z W Lane 15 X Y Z W . . . SIMD0 SIMD0 SIMD1 SIMD3
  • 34. ANOTHER SOLUTION y Spli{ng computa8on into mul8ple kernels ‒ Primary ray gen kernel ‒ Trace kernel ‒ Evaluate DI kernel ‒ Etc y BeXer HW u8liza8on ‒ Less divergence ‒ Higher HW occupancy 34 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 35. OTHER BENEFITS? 35 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Ray Genera8on First Hit First Hit (Normal) Direct Illumina8on Indirect Illumina8on y Maintainability ‒ Debug is not as easy as we do on C, C++ ‒ Not all compilers are mature ‒ Can hit to a compiler bug, which is hard to debug ‒ Helps compiler ‒ By spli{ng kernels, we can isolate the issue ‒ If the code is developed by many people, this is important y Extendibility, Portability ‒ Easy to extend features ‒ Primary Ray Gen Kernel ‒ Add another camera projec8on ‒ Ray Cas8ng Kernel ‒ Easy to add another primi8ves (e.g., vector displacement) ‒ Take it out for physics ray cas8ng queries
  • 36. PORTING TO OPENCL (SECOND ATTEMPT)
  • 37. SPLITTING KERNELS TRANSFORMING CPU CODE Naïve CPU implementa<on Preparing for OpenCL implementa<on forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 37 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { forAll() PrimaryRayGenKernel(); while(1) { forAll() TraceKernel(); if( !any( hits ) ) break; forAll() SampleLightKernel(); forAll() TraceKernel(); forAll() AccumulateDIKernel(); forAll() SampleNextRayKernel(); } } Each for loop => A kernel execu8on
  • 38. SPLITTING KERNELS CPU implementa<on Host code forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } } 38 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  • 39. SPLITTING KERNELS Host Code Device Code { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } } 39 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void PrimaryRayGenKernel(); y Generate rays for all pixels in parallel __kernel void TraceKernel(); y Compute intersec8on for all rays in parallel __kernel void SampleLightKernel(); y Sample light for all hit points in parallel __kernel void AccumulateDItKernel(); y Accumulate DI for all hit points in parallel __kernel void SampleNextRayKernel(); y Generate bounced rays for all hit points in parallel
  • 40. DESIGN LOCALIZE BRANCH Camera Type Brdf Type { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } } 40 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  • 41. DESIGN RAY STATE y Cannot keep state between kernels y Ray state needs to be saved to/restored from global memory 41 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y Example ‒ Ray genera8on + ray direc8on visualiza8on __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); } struct RayState { float4 m_throughput; int2 m_randomNumber; int m_pixelIdx; };
  • 42. DESIGN RAY STATE y Cannot keep state between kernels y Ray state needs to be saved to/restored from global memory In: pixelIdx In: pixelIdx In: pixelIdx Global memory (state) 42 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y Example ‒ Ray genera8on + ray direc8on visualiza8on __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); } Out: Ray Out: Ray Out: Ray In: pixelIdx Out: Ray PrimaryRayGenKernel In: Ray Out: Pixel color In: Ray Out: Pixel color In: Ray Out: Pixel color In: Ray Out: Pixel color VisualizeKernel
  • 43. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy Primary (3/8, 1/8, 4/8) Secondary 43 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 44. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) Primary Secondary y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8) Primary Secondary *When rays are created for all pixels, this is not necessary 44 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 45. RAY COMPACTION y Sparse data lowers SIMD u8liza8on y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8) *When rays are created for all pixels, this is not necessary 45 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y No need to write a compac8on kernel y Can compact using global atomics ‒ Prepare a counter (gRayCount) ‒ Perform atomic increment to reserve memory ‒ BeXer to do atomics in WG first, then do an atomic add per WG __kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; } Primary Secondary Primary Secondary
  • 46. DIRECT ILLUMINATION COMPUTATION y SampleLightKernel ‒ Want to keep the work uniform ‒ Different # of light sample per ray isn’t good ‒ Compute contribu8on from one point on a light y Simple approach ‒ Select a light ‒ Select a point on a light ‒ Compute DI without occlusion term y More sophis8cated light sampling ‒ Using poten8al contribu8on for PDF ‒ Forward+ style light culling 46 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; }
  • 47. DIRECT ILLUMINATION COMPUTATION y SampleLightKernel y TraceRayKernel ‒ Check if the point on the light is visible or not ‒ Reuse code y AccumulateDIKernel ‒ If the ray is not blocked, accumulate the result 47 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; } __kernel void AccumulateDIKernel(__global ...) { Hit shadowHit = gShadowHit[GIDX]; float4 di = gDi[GIDX]; if( !shadowHit ) gFb[GIDX] += di; }
  • 48. SAMPLE NEXT RAY y Compute next ray by sampling BRDF y Store ray and ray state 48 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 __kernel void SampleNextRayKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; Hit hit = gHit[GIDX]; if( !hit ) return; nextRay, s = Brdf_Sample( ray, s ); int dst = atom_inc( &gRayCount ); gRayNext[dst] = nextRay; gRayStateNext[dst] = s; }
  • 49. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 49 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Mesh0 xform0 Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  • 50. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf 50 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Top BVH Mesh0 xform0 Bohom BVH Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  • 51. TRACE KERNEL y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer) 0 1 2 3 4 5 6 7 8 9 y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf y Store those BVHs in a single memory ‒ Traverse top tree ‒ Hit a leaf, transform the ray into object space ‒ Traverse boXom tree ‒ On exit, transform the ray back to world space Bohom A Bohom B Bohom C Bohom D 51 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 root idx Top 3 4 5 6 Top BVH Mesh0 xform0 Bohom BVH Mesh1 xform1 Mesh2 xform2 Mesh3 xform3
  • 52. SO FAR y Explained an OpenCL implementa8on of a simple path tracer y Easy to extend from here y Extension can be done by swapping one or two kernels ‒ Material system, Shader ‒ Light sampling ‒ Support for different type of primi8ves ‒ Ray caster + spa8al accelera8on structure 52 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 54. INSTANCING y Powerful technique to increase geometric complexity y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform 54 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 1 2 3 4 5 6 Top BVH Mesh0 xform0 Mesh0 xform1 Mesh0 xform2 Mesh1 xform3
  • 55. INSTANCING y Powerful technique to increase geometric complexity y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform 55 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 0 Bohom A Bohom B Top 1 2 3 4 5 6 Top BVH Mesh0 xform0 Mesh0 xform1 Mesh0 xform2 Mesh1 xform3 Bohom BVH
  • 56. LAYERED MATERIAL y Binary tree of BRDFs y Leaf node ‒ BRDF y Internal node ‒ Blend func8on ‒ Fresnel blend, Linear blend y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node 56 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Reflect Diffuse 0.5 0.5 Microfacet pdf=0.25 pdf=0.5 0.5 0.5 pdf=0.25
  • 57. LAYERED MATERIAL y Binary tree of BRDFs y Leaf node ‒ BRDF y Internal node ‒ Blend func8on ‒ Fresnel blend, Linear blend y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node 57 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SelectBRDFKernel ); launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
  • 59. VR y Latency is super important y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample 59 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 60. VR y Latency is important y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample 60 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 { launch( VRPrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } launch( FillPixelKernel ); }
  • 61. DISPLACEMENT MAPPING y Powerful technique to increase geometric complexity y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small y Direct ray tracing ‒ When hit a patch, tessellate and displace Base mesh Vector displacement map With vector displacement 61 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Fig. from hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
  • 62. DISPLACEMENT MAPPING y Powerful technique to increase geometric complexity y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small y Direct ray tracing ‒ When hit a patch, tessellate and displace y To amor8ze tessella8on, displacement cost, batch ray intersec8on y Need to change TraceKernel 62 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 63. DISPLACEMENT MAPPING y TraceKernel ‒ If a ray hit a quad with displacement map, save (ray, primi8ve) to a buffer ‒ Sort (ray, primi8ve) pairs by primi8ve index ‒ Process primi8ves in the list in parallel y For each patch ‒ Build quad BVH in parallel ‒ Cast rays in parallel y Key is work buffer memory alloca8on 63 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 Level 0 (1 node) BVH Comt Ray Cast Ray Cast Ray Cast Ray Cast Level 2 (16 nodes) BVH Comt BVH Comt BVH Comt BVH Comt BVH Comt BVH Comt Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Level 1 (4 nodes) BVH Comt BVH Comt BVH Comt BVH Comt
  • 64. VECTOR DISPLACEMENT IN ACTION Base mesh Vector displacement 64 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 52GB memory if pre tessella8on is used
  • 65. OPEN SHADING LANGUAGE y OSL itself has nothing to do with OpenCL y Many use cases y Using OSL in OCL renderer ‒ Translate OSL to ‒ OCL kernel ‒ SPIR ‒ Feed those to OCL run8me ‒ clBuildProgram ‒ clCreateKernel 65 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014 y OSL example surface maXe [[ string descrip8on = "Lamber8an diffuse material" ]] (float Kd = 1 [[float UImin = 0, float UIsozmax = 1 ]], color Cs = 1 [[float UImin = 0, float UImax = 1 ]], string texname = “diffuse.tex” [[int texture_slot = 1]] ) {     Ci = Kd * Cs * noise(5.0 * P) * diffuse (N); }
  • 66. SPIR y Standard Portable Intermediate Representa8on y Based on LLVM IR (32, 64) y Useful to ship OpenCL Apps y Device independent y OpenCL did not have usable binary code representa8on ‒ Binary for each device x driver ‒ Combina8on explode ‒ Embed kernel as string ‒ Load source, clCreateProgramWithSource ‒ Dump binary, clGetProgramInfo + CL_PROGRAM_BINARIES ‒ Load binary, clCreateProgramWithBinary y OpenCL implementa8on has to support cl_khr_spir extension ‒ Works on AMD, Intel (OpenCL 1.2) ‒ SPIR 2.0 is coming with OpenCL 2.0 66 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 67. SPIR CREATE SPIR BINARY y Offline compiler ‒ clang-­‐spir* -­‐cc1 -­‐emit-­‐llvm-­‐bc -­‐triple spir-­‐unknown-­‐unknown -­‐cl-­‐spir-­‐compile-­‐op8ons ”-­‐x spir" -­‐include <opencl_spir.h> -­‐o <output> <input> ‒ clBuildProgram with “-­‐x spir -­‐spir-­‐std=CL1.2” y Use host OpenCL API ‒ clCompileProgram + Op8on ‒ clGetProgramInfo + CL_PROGRAM_BINARIES *hXps://github.com/KhronosGroup/SPIR 67 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
  • 68. 68 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014