Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

INTRODUCTION
TO
MONTE
CARLO
RAY
TRACING
OPENCL
IMPLEMENTATION
TAKAHIRO
HARADA
9/2014

RECAP
OF
LAST
SESSION
y Talked
about
theory
y BRDFs
‒ Reflec8on,
Refrac8on,
Diffuse,
Microfacet
y Fresnel
is
everywhere
y Monte
Carlo
Ray
Tracing
‒ Intui8ve
understanding
of
Monte
Carlo
Integra8on
‒ Simple
sampling
(Random
sampling)
‒ BeXer
sampling
(Importance
sampling)
‒ Layered
material
hXp://www.slideshare.net/takahiroharada/introduc8on-‐to-‐monte-‐carlo-‐ray-‐tracing-‐cedec2013
2
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Direct
illumina<on
for(
i,
j
)
{
ray
=
PrimaryRayGen(
camera,
pixelLoc
);
{
hit
=
Trace(
ray
);
if(
hit
)
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit
);
}
}
3
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Indirect
illumina<on
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
4
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Direct
illumina<on
Indirect
Illumina<on
for(
i,
j
)
{
ray
=
PrimaryRayGen(
camera,
pixelLoc
);
{
hit
=
Trace(
ray
);
if(
hit
)
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit
);
}
}
5
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}

COMPARISON
Direct
illumina<on
Indirect
illumina<on
6
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

WHY
OPENCL?
y Speed!
‒ GPU
can
accelerate
it
‒ Why?
Faster
is
the
beXer
y OpenCL
is
an
API
for
GPU
compute
y OpenCL
is
not
only
for
graphics
programmers
y OpenCL
does
not
always
require
a
GPU
‒ Runs
on
CPU
too
‒ Runs
if
there
is
a
CPU
(everywhere)
y If
renderer
is
wriXen
in
OpenCL,
runs
on
Windows,
Linux,
MacOSX
7
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
J

PORTING
TO
OPENCL
(FIRST
ATTEMPT)

THINGS
TO
BE
DONE
DATA
STRUCTURE
y No
pointer
in
OpenCL*
y Change
pointer
to
index
y Stored
in
a
flat
memory
y Not
suited
for
par8al
update
*Shared
Virtual
Memory
(OpenCL
2.0)
9
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

10
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
THINGS
TO
BE
DONE
y Node
data
for
a
binary
tree
‒ Spa8al
accelera8on
structure
(BVH)
‒ Shading
network
y Buffer<NodeData>
nodeData;
DATA
STRUCTURE
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1

THINGS
TO
BE
DONE
DATA
STRUCTURE
y Material
‒ Texture
entry
y Buffer<Material>
material;
y Buffer<char>
texData;
y Buffer<uint>
texTable;
11
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Material0
m_kd
m_ior
m_...
m_kdTex
m_iorTex
m_bumpTex
TextureTable
.
.
.
tex0
tex1
tex2
tex3
tex4
Texture0
m_header
m_data
Texture1
m_header
m_data
Texture2
m_header
m_data
Material1
m_kd
m_ior
m_...
m_kdTex
m_iorTex
m_bumpTex
.
.
.
Texture3
m_header
m_data

THINGS
TO
BE
DONE
WRITING
OPENCL
KERNEL
CPU
code
OpenCL
kernel
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(!hit
)
break;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
12
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
PtKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(!hit
)
return;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}

IT
WORKS
BUT…
y This
approach
is
simple
y But
a
lot
of
issues
13
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

DRAWBACKS
PERFORMANCE
y Likely
not
u8lize
hardware
efficiently
‒ SIMD
divergence
‒ GPU
occupancy
(latency)
y Maintainability
y Extendibility,
Portability
14
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

OPENCL
ON
CPU
y Processing
element
executes
Work
item
16
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
(thread)
‒ A
SIMD
lane
(4*)
y Compute
unit
executes
Work
group
(thread
group)
‒ A
core
(8*)
‒ #
of
processing
elements
!=
#
of
work
items
y Compute
device
executes
Kernel
(shader)
‒ A
CPU
‒ #
of
compute
units
!=
#
of
work
groups
*
On
AMD
FX-‐8350
Work
item
Processing
element
Compute
Unit
Work
group
Kernel
.
.
.

GPU
VS
CPU
y Processing
element
executes
Work
item
‒ A
SIMD
lane
(64*)
y Compute
unit
executes
Work
group
‒ A
SIMD
engine
(44x4*)
‒ #
of
processing
elements
!=
#
of
work
items
y Compute
device
executes
Kernel
‒ A
GPU
‒ #
of
compute
units
!=
#
of
work
groups
*
On
AMD
Radeon
R9
290X
17
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Work
item
GPU
CPU
Processing
element
Compute
Unit
(4)
Work
group
Kernel
.
.
.
Processing
element
Compute
Unit
(64)
...

HIGH
LEVEL
DESCRIPTION
y Today’s
GPU
is
similar
to
a
CPU
(if
you
look
at
very
high
level)
‒ GPU
is
an
extremely
wide
CPU
‒ Many
cores
‒ Wide
SIMD
y AMD
Radeon
R9
290X
GPU
‒ 176
=
44x4
SIMD
engines
(cores)
‒ 64
wide
SIMD
y But
different
in
‒ SIMD
width
(very
wide)
‒ Limited
local
resources
‒ Strategy
to
hide
latency
y Knowing
those
are
the
key
to
exploit
the
performance
18
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
19
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
20
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
21
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
22
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
23
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
24
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
25
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7

WIDE
SIMD
EXECUTION
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
L
L
L
26
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
J
J
L
J

SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
J
J
J
27
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
J
J
J
J

LATENCY
y Highest
latency
is
from
memory
access
y CPU
prevent
it
by
having
larger
cache
‒ Latency
of
cache
access
is
small
(fast)
y Most
of
the
memory
access
do
not
go
to
memory
y CPU
can
run
at
full
speed
un8l
a
cache
miss
y #
of
concurrent
execu8on
on
the
GPU
is
far
much
larger
than
CPU
‒ More
than
11k
(=
44x4x64)
work
items
y GPU
cache
is
not
large
enough
to
absorb
memory
requests
from
those
if
they
all
requests
different
part
of
memory
y Strategy
‒ Keep
memory
access
as
local
as
possible
(not
realis8c
for
prac8cal
apps)
‒ Uses
GPU
mechanism
for
latency
hiding
28
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

GPU
LATENCY
HIDING
y GPU
can
execute
at
full
speed
if
there
are
only
ALU
instruc8ons
(Inst.
0
-‐
2)
*
y Stalls
on
memory
access
instruc8on
(Inst.
3)
*
Can
hide
latency
using
logical
vector
29
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Inst.
0
Inst.
1
Inst.
2
Inst.
3
Lane0
LLaannee11
Lane2
Lane3
(MemAccess)
Inst.
4

GPU
LATENCY
HIDING
y When
stalled,
switch
to
another
work
group
y Could
fill
the
stall
with
instruc8ons
from
WG1
y A
SIMD
of
GPU
needs
to
process
mul8ple
WGs
at
the
same
8me
to
hide
latency
(or
maximize
its
throughput)
30
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
WG0:
Inst.
0
WG0:
Inst.
1
WG0:
Inst.
2
WG0:
Inst.
3
Lane0
LLaannee11
Lane2
Lane3
WG0:
Inst.
4
WG1:
Inst.
0
WG1:
Inst.
1
WG1:
Inst.
2
WG1:
Inst.
3
Lane0
LLaannee11
Lane2
Lane3

HOW
MANY
WGS
CAN
WE
EXECUTE
PER
SIMD
y 10
wavefronts
(64WIs)
per
SIMD
is
the
max
y It
depends
on
local
resource
usage
of
the
kernel
y VGPR
usage
is
ozen
the
problem
y Share
256
VGPRs
among
n
work
groups
‒ 1
wavefront,
256VGPRs
LL
‒ 2
wavefronts,
128VGPRs
‒ 4
wavefronts,
64VGPRs
J
‒ 10
wavefronts,
25VGPRs
y Share
16KB
LDS
among
n
work
groups
‒ 1
work
group,
16KB
LL
‒ 2
work
group,
8KB
‒ 4
work
group,
4KB
J
31
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y VGPRs
‒ Registers
used
by
vector
ALUs
‒ 64KB/SIMD
‒ 256
VGPRs/SIMD
lane
(=
64KB/64/4)
y LDS
(Local
data
share)
‒ 64KB/CU
(CU
==
4SIMD)
‒ 32KB/SIMD

ADVICE
TO
REDUCE
VGPR
PRESSURE
GET
MORE
PERFORMANCE
FROM
GPU
y Don’t
write
a
large
kernel
y If
the
program
can
be
split
into
several
pieces,
split
them
into
several
kernels
‒ Single
kernel
approach
‒ VGPR
usage
of
the
kernel
is
200
=
max(60,
200,
10)
‒ 1
wavefront
per
SIMD
‒ Bad
for
latency
hiding
‒ Mul8ple
kernel
approach
‒ FuncA:
4
wavefronts
per
SIMD
‒ FuncB:
1
wavefronts
per
SIMD
‒ FuncC:
10
wavefronts
per
SIMD
‒ FuncB
is
bad,
but
FuncA,
FuncC
runs
fast
y Helps
compiler
too
32
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Single
Kernel
FuncA
(60VGPRS)
FuncB
(200VGPRS)
FuncC(10VGPRS)
Mul8ple
Kernels
FuncA
(60VGPRS)
FuncB
(200VGPRS)
FuncC(10VGPRS)
L
L
L
J
L
J

VLIW4
(NI),
VLIW5
(EG)
Scalar
Architecture
(SI,
CI)
Lane
1
Lane
2
Lane
3
Lane
4
33
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
W
WHAT
IS
THE
SCALAR
ARCHITECTURE?
y Best
for
vector
computa8on
y Low
efficiency
on
scalar
computa8on
y Physical
concurrent
execu8on
‒ 16
work
items
‒ 4
ALU
opera8ons
each
‒ Total:
16x4
ALU
opera8ons
y 1
SIMD
is
running
in
a
CU
y Difficult
to
fill
xyzw
y If
not
filled,
we
waste
HW
cycle
y Good
for
both
vector
and
scalar
computa8on
y Need
more
work
groups
to
fill
GPU
y Physical
concurrent
execu8on
‒ 16x4
work
items
‒ 1
ALU
opera8on
each
‒ Total:
16x4
ALU
opera8ons
y 4
SIMDs
are
running
in
a
CU
y 4x
more
work
groups
are
necessary
to
fill
HW
X
Y
Z
W .
.
.
Lane
0
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
Lane
15
X
Y
Z
W
.
.
.
SIMD0
SIMD0
SIMD1
SIMD3

ANOTHER
SOLUTION
y Spli{ng
computa8on
into
mul8ple
kernels
‒ Primary
ray
gen
kernel
‒ Trace
kernel
‒ Evaluate
DI
kernel
‒ Etc
y BeXer
HW
u8liza8on
‒ Less
divergence
‒ Higher
HW
occupancy
34
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

OTHER
BENEFITS?
35
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Ray
Genera8on
First
Hit
First
Hit
(Normal)
Direct
Illumina8on
Indirect
Illumina8on
y Maintainability
‒ Debug
is
not
as
easy
as
we
do
on
C,
C++
‒ Not
all
compilers
are
mature
‒ Can
hit
to
a
compiler
bug,
which
is
hard
to
debug
‒ Helps
compiler
‒ By
spli{ng
kernels,
we
can
isolate
the
issue
‒ If
the
code
is
developed
by
many
people,
this
is
important
y Extendibility,
Portability
‒ Easy
to
extend
features
‒ Primary
Ray
Gen
Kernel
‒ Add
another
camera
projec8on
‒ Ray
Cas8ng
Kernel
‒ Easy
to
add
another
primi8ves
(e.g.,
vector
displacement)
‒ Take
it
out
for
physics
ray
cas8ng
queries

PORTING
TO
OPENCL
(SECOND
ATTEMPT)

SPLITTING
KERNELS
TRANSFORMING
CPU
CODE
Naïve
CPU
implementa<on
Preparing
for
OpenCL
implementa<on
forAll()
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
37
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
forAll()
PrimaryRayGenKernel();
while(1)
{
forAll()
TraceKernel();
if(
!any(
hits
)
)
break;
forAll()
SampleLightKernel();
forAll()
TraceKernel();
forAll()
AccumulateDIKernel();
forAll()
SampleNextRayKernel();
}
}
Each
for
loop
=>
A
kernel
execu8on

SPLITTING
KERNELS
CPU
implementa<on
Host
code
forAll()
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
38
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}

SPLITTING
KERNELS
Host
Code
Device
Code
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}
39
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
PrimaryRayGenKernel();
y Generate
rays
for
all
pixels
in
parallel
__kernel
void
TraceKernel();
y Compute
intersec8on
for
all
rays
in
parallel
__kernel
void
SampleLightKernel();
y Sample
light
for
all
hit
points
in
parallel
__kernel
void
AccumulateDItKernel();
y Accumulate
DI
for
all
hit
points
in
parallel
__kernel
void
SampleNextRayKernel();
y Generate
bounced
rays
for
all
hit
points
in
parallel

DESIGN
LOCALIZE
BRANCH
Camera
Type
Brdf
Type
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}
40
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}

DESIGN
RAY
STATE
y Cannot
keep
state
between
kernels
y Ray
state
needs
to
be
saved
to/restored
from
global
memory
41
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y Example
‒ Ray
genera8on
+
ray
direc8on
visualiza8on
__kernel
void
PrimaryRayGenKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
//
save
}
__kernel
void
VisualizeRayKernel(__global
...)
{
RayState
s
=
gRayState[get_global_id(0)];
//
restore
Ray
ray
=
gRay[get_global_id(0)];
gFb[s.m_pixelIdx]
=
Ray_getDir(
ray
);
}
struct
RayState
{
float4
m_throughput;
int2
m_randomNumber;
int
m_pixelIdx;
};

DESIGN
RAY
STATE
y Cannot
keep
state
between
kernels
y Ray
state
needs
to
be
saved
to/restored
from
global
memory
In:
pixelIdx
In:
pixelIdx
In:
pixelIdx
Global
memory
(state)
42
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y Example
‒ Ray
genera8on
+
ray
direc8on
visualiza8on
__kernel
void
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
//
save
}
__kernel
void
VisualizeRayKernel(__global
...)
{
RayState
s
=
gRayState[get_global_id(0)];
//
restore
Ray
ray
=
gRay[get_global_id(0)];
gFb[s.m_pixelIdx]
=
Ray_getDir(
ray
);
}
Out:
Ray
Out:
Ray
Out:
Ray
In:
pixelIdx
Out:
Ray
PrimaryRayGenKernel
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
VisualizeKernel

RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
Primary
(3/8,
1/8,
4/8)
Secondary
43
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
(3/8,
1/8,
4/8)
Primary
Secondary
y With
compac8on
‒ 1
SIMD
execu8on
‒ Occupancy
(7/8)
Primary
Secondary
*When
rays
are
created
for
all
pixels,
this
is
not
necessary
44
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
(3/8,
1/8,
4/8)
y With
compac8on
‒ 1
SIMD
execu8on
‒ Occupancy
(7/8)
*When
rays
are
created
for
all
pixels,
this
is
not
necessary
45
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y No
need
to
write
a
compac8on
kernel
y Can
compact
using
global
atomics
‒ Prepare
a
counter
(gRayCount)
‒ Perform
atomic
increment
to
reserve
memory
‒ BeXer
to
do
atomics
in
WG
first,
then
do
an
atomic
add
per
WG
__kernel
void
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
}
Primary
Secondary
Primary
Secondary

DIRECT
ILLUMINATION
COMPUTATION
y SampleLightKernel
‒ Want
to
keep
the
work
uniform
‒ Different
#
of
light
sample
per
ray
isn’t
good
‒ Compute
contribu8on
from
one
point
on
a
light
y Simple
approach
‒ Select
a
light
‒ Select
a
point
on
a
light
‒ Compute
DI
without
occlusion
term
y More
sophis8cated
light
sampling
‒ Using
poten8al
contribu8on
for
PDF
‒ Forward+
style
light
culling
46
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleLightKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
shadowRay,
lfnDotV
=
Light_Sample(
ray,
s
);
gShadowRay[GIDX]
=
shadowRay;
gDi[GIDX]
=
lfnDotV;
gRayState[GIDX]
=
s;
}

DIRECT
ILLUMINATION
COMPUTATION
y SampleLightKernel
y TraceRayKernel
‒ Check
if
the
point
on
the
light
is
visible
or
not
‒ Reuse
code
y AccumulateDIKernel
‒ If
the
ray
is
not
blocked,
accumulate
the
result
47
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleLightKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
shadowRay,
lfnDotV
=
Light_Sample(
ray,
s
);
gShadowRay[GIDX]
=
shadowRay;
gDi[GIDX]
=
lfnDotV;
gRayState[GIDX]
=
s;
}
__kernel
void
AccumulateDIKernel(__global
...)
{
Hit
shadowHit
=
gShadowHit[GIDX];
float4
di
=
gDi[GIDX];
if(
!shadowHit
)
gFb[GIDX]
+=
di;
}

SAMPLE
NEXT
RAY
y Compute
next
ray
by
sampling
BRDF
y Store
ray
and
ray
state
48
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleNextRayKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
Hit
hit
=
gHit[GIDX];
if(
!hit
)
return;
nextRay,
s
=
Brdf_Sample(
ray,
s
);
int
dst
=
atom_inc(
&gRayCount
);
gRayNext[dst]
=
nextRay;
gRayStateNext[dst]
=
s;
}

TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
49
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Mesh0
xform0
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3

TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
y 2
level
BVH
‒ Top:
stores
an
object
in
a
leaf
(object
index,
transform)
‒ BoXom:
stores
a
primi8ve
(triangle,
quad)
in
a
leaf
50
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Bohom
BVH
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3

TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
y 2
level
BVH
‒ Top:
stores
an
object
in
a
leaf
(object
index,
transform)
‒ BoXom:
stores
a
primi8ve
(triangle,
quad)
in
a
leaf
y Store
those
BVHs
in
a
single
memory
‒ Traverse
top
tree
‒ Hit
a
leaf,
transform
the
ray
into
object
space
‒ Traverse
boXom
tree
‒ On
exit,
transform
the
ray
back
to
world
space
Bohom
A
Bohom
B
Bohom
C
Bohom
D
51
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
root
idx
Top
3
4
5
6
Top
BVH
Mesh0
xform0
Bohom
BVH
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3

SO
FAR
y Explained
an
OpenCL
implementa8on
of
a
simple
path
tracer
y Easy
to
extend
from
here
y Extension
can
be
done
by
swapping
one
or
two
kernels
‒ Material
system,
Shader
‒ Light
sampling
‒ Support
for
different
type
of
primi8ves
‒ Ray
caster
+
spa8al
accelera8on
structure
52
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

INSTANCING
y Powerful
technique
to
increase
geometric
complexity
y Small
memory
overhead
‒ Shares
geometric
informa8on
(vertex,
normal
etc)
‒ Shares
BVH
‒ Stores
object
transform
54
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Mesh0
xform1
Mesh0
xform2
Mesh1
xform3

INSTANCING
y Powerful
technique
to
increase
geometric
complexity
y Small
memory
overhead
‒ Shares
geometric
informa8on
(vertex,
normal
etc)
‒ Shares
BVH
‒ Stores
object
transform
55
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
Bohom
A
Bohom
B
Top
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Mesh0
xform1
Mesh0
xform2
Mesh1
xform3
Bohom
BVH

LAYERED
MATERIAL
y Binary
tree
of
BRDFs
y Leaf
node
‒ BRDF
y Internal
node
‒ Blend
func8on
‒ Fresnel
blend,
Linear
blend
y Evaluate
one
BRDF
at
a
8me
‒ Traverse
binary
tree
‒ Random
sampling
at
internal
node
56
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Reflect
Diffuse
0.5
0.5
Microfacet
pdf=0.25
pdf=0.5
0.5
0.5
pdf=0.25

LAYERED
MATERIAL
y Binary
tree
of
BRDFs
y Leaf
node
‒ BRDF
y Internal
node
‒ Blend
func8on
‒ Fresnel
blend,
Linear
blend
y Evaluate
one
BRDF
at
a
8me
‒ Traverse
binary
tree
‒ Random
sampling
at
internal
node
57
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SelectBRDFKernel
);
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}

VR
y Latency
is
super
important
y To
improve
a
frame
rendering
8me,
‒ Used
mul8ple
GPUs
‒ Foveated
rendering
y More
than
60fps
on
4
Hawaii
GPUs
‒ 6M
triangles
‒ 32
shadow
rays/sample
‒ 2
AA
rays/sample
59
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

VR
y Latency
is
important
y To
improve
a
frame
rendering
8me,
‒ Used
mul8ple
GPUs
‒ Foveated
rendering
y More
than
60fps
on
4
Hawaii
GPUs
‒ 6M
triangles
‒ 32
shadow
rays/sample
‒ 2
AA
rays/sample
60
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
VRPrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
launch(
FillPixelKernel
);
}

DISPLACEMENT
MAPPING
y Powerful
technique
to
increase
geometric
complexity
y Pre
tessella8on
‒ Required
memory
is
too
large
‒ GPU
memory
is
too
small
y Direct
ray
tracing
‒ When
hit
a
patch,
tessellate
and
displace
Base
mesh
Vector
displacement
map
With
vector
displacement
61
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Fig.
from
hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component

DISPLACEMENT
MAPPING
y Powerful
technique
to
increase
geometric
complexity
y Pre
tessella8on
‒ Required
memory
is
too
large
‒ GPU
memory
is
too
small
y Direct
ray
tracing
‒ When
hit
a
patch,
tessellate
and
displace
y To
amor8ze
tessella8on,
displacement
cost,
batch
ray
intersec8on
y Need
to
change
TraceKernel
62
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

DISPLACEMENT
MAPPING
y TraceKernel
‒ If
a
ray
hit
a
quad
with
displacement
map,
save
(ray,
primi8ve)
to
a
buffer
‒ Sort
(ray,
primi8ve)
pairs
by
primi8ve
index
‒ Process
primi8ves
in
the
list
in
parallel
y For
each
patch
‒ Build
quad
BVH
in
parallel
‒ Cast
rays
in
parallel
y Key
is
work
buffer
memory
alloca8on
63
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Level
0
(1
node)
BVH
Comt
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Level
2
(16
nodes)
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Level
1
(4
nodes)
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt

VECTOR
DISPLACEMENT
IN
ACTION
Base
mesh
Vector
displacement
64
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
52GB
memory
if
pre
tessella8on
is
used

OPEN
SHADING
LANGUAGE
y OSL
itself
has
nothing
to
do
with
OpenCL
y Many
use
cases
y Using
OSL
in
OCL
renderer
‒ Translate
OSL
to
‒ OCL
kernel
‒ SPIR
‒ Feed
those
to
OCL
run8me
‒ clBuildProgram
‒ clCreateKernel
65
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y OSL
example
surface
maXe
[[
string
descrip8on
=
"Lamber8an
diffuse
material"
]]
(float
Kd
=
1
[[float
UImin
=
0,
float
UIsozmax
=
1
]],
color
Cs
=
1
[[float
UImin
=
0,
float
UImax
=
1
]],
string
texname
=
“diffuse.tex”
[[int
texture_slot
=
1]]　)
{
　　　　Ci
=
Kd
*
Cs
*
noise(5.0
*
P)
*
diffuse
(N);
}

SPIR
y Standard
Portable
Intermediate
Representa8on
y Based
on
LLVM
IR
(32,
64)
y Useful
to
ship
OpenCL
Apps
y Device
independent
y OpenCL
did
not
have
usable
binary
code
representa8on
‒ Binary
for
each
device
x
driver
‒ Combina8on
explode
‒ Embed
kernel
as
string
‒ Load
source,
clCreateProgramWithSource
‒ Dump
binary,
clGetProgramInfo
+
CL_PROGRAM_BINARIES
‒ Load
binary,
clCreateProgramWithBinary
y OpenCL
implementa8on
has
to
support
cl_khr_spir
extension
‒ Works
on
AMD,
Intel
(OpenCL
1.2)
‒ SPIR
2.0
is
coming
with
OpenCL
2.0
66
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

SPIR
CREATE
SPIR
BINARY
y Offline
compiler
‒ clang-‐spir*
-‐cc1
-‐emit-‐llvm-‐bc
-‐triple
spir-‐unknown-‐unknown
-‐cl-‐spir-‐compile-‐op8ons
”-‐x
spir"
-‐include
<opencl_spir.h>
-‐o
<output>
<input>
‒ clBuildProgram
with
“-‐x
spir
-‐spir-‐std=CL1.2”
y Use
host
OpenCL
API
‒ clCompileProgram
+
Op8on
‒ clGetProgramInfo
+
CL_PROGRAM_BINARIES
*hXps://github.com/KhronosGroup/SPIR
67
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

68
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)

Similar to Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014) (20)

More from Takahiro Harada

More from Takahiro Harada (15)

Recently uploaded

Recently uploaded (20)

Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)