2. RECAP
OF
LAST
SESSION
y Talked
about
theory
y BRDFs
‒ Reflec8on,
Refrac8on,
Diffuse,
Microfacet
y Fresnel
is
everywhere
y Monte
Carlo
Ray
Tracing
‒ Intui8ve
understanding
of
Monte
Carlo
Integra8on
‒ Simple
sampling
(Random
sampling)
‒ BeXer
sampling
(Importance
sampling)
‒ Layered
material
hXp://www.slideshare.net/takahiroharada/introduc8on-‐to-‐monte-‐carlo-‐ray-‐tracing-‐cedec2013
2
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
3. REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Direct
illumina<on
for(
i,
j
)
{
ray
=
PrimaryRayGen(
camera,
pixelLoc
);
{
hit
=
Trace(
ray
);
if(
hit
)
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit
);
}
}
3
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
4. REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Indirect
illumina<on
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
4
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
5. REVIEW
SIMPLE
CPU
MC
RAY
TRACER
Direct
illumina<on
Indirect
Illumina<on
for(
i,
j
)
{
ray
=
PrimaryRayGen(
camera,
pixelLoc
);
{
hit
=
Trace(
ray
);
if(
hit
)
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit
);
}
}
5
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(
!hit
)
break
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
6. COMPARISON
Direct
illumina<on
Indirect
illumina<on
6
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
7. WHY
OPENCL?
y Speed!
‒ GPU
can
accelerate
it
‒ Why?
Faster
is
the
beXer
y OpenCL
is
an
API
for
GPU
compute
y OpenCL
is
not
only
for
graphics
programmers
y OpenCL
does
not
always
require
a
GPU
‒ Runs
on
CPU
too
‒ Runs
if
there
is
a
CPU
(everywhere)
y If
renderer
is
wriXen
in
OpenCL,
runs
on
Windows,
Linux,
MacOSX
7
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
J
9. THINGS
TO
BE
DONE
DATA
STRUCTURE
y No
pointer
in
OpenCL*
y Change
pointer
to
index
y Stored
in
a
flat
memory
y Not
suited
for
par8al
update
*Shared
Virtual
Memory
(OpenCL
2.0)
9
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
10. 10
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
THINGS
TO
BE
DONE
y Node
data
for
a
binary
tree
‒ Spa8al
accelera8on
structure
(BVH)
‒ Shading
network
y Buffer<NodeData>
nodeData;
DATA
STRUCTURE
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1
Node
Data
m_max.x
m_max.y
m_max.z
m_min.x
m_min.y
m_min.z
m_child0
m_child1
11. THINGS
TO
BE
DONE
DATA
STRUCTURE
y Material
‒ Texture
entry
y Buffer<Material>
material;
y Buffer<char>
texData;
y Buffer<uint>
texTable;
11
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Material0
m_kd
m_ior
m_...
m_kdTex
m_iorTex
m_bumpTex
TextureTable
.
.
.
tex0
tex1
tex2
tex3
tex4
Texture0
m_header
m_data
Texture1
m_header
m_data
Texture2
m_header
m_data
Material1
m_kd
m_ior
m_...
m_kdTex
m_iorTex
m_bumpTex
.
.
.
Texture3
m_header
m_data
12. THINGS
TO
BE
DONE
WRITING
OPENCL
KERNEL
CPU
code
OpenCL
kernel
for(
i,
j
)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(!hit
)
break;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
12
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
PtKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
while(1)
{
hit
=
Trace(
ray
);
if(!hit
)
return;
d(
pixelLoc
)
+=
EvaluateDI(
ray,
hit,
rayState
);
ray,
rayState
=
sampleNextRay(
ray,
hit
);
}
}
13. IT
WORKS
BUT…
y This
approach
is
simple
y But
a
lot
of
issues
13
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
14. DRAWBACKS
PERFORMANCE
y Likely
not
u8lize
hardware
efficiently
‒ SIMD
divergence
‒ GPU
occupancy
(latency)
y Maintainability
y Extendibility,
Portability
14
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
16. OPENCL
ON
CPU
y Processing
element
executes
Work
item
16
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
(thread)
‒ A
SIMD
lane
(4*)
y Compute
unit
executes
Work
group
(thread
group)
‒ A
core
(8*)
‒ #
of
processing
elements
!=
#
of
work
items
y Compute
device
executes
Kernel
(shader)
‒ A
CPU
‒ #
of
compute
units
!=
#
of
work
groups
*
On
AMD
FX-‐8350
Work
item
Processing
element
Compute
Unit
Work
group
Kernel
.
.
.
17. GPU
VS
CPU
y Processing
element
executes
Work
item
‒ A
SIMD
lane
(64*)
y Compute
unit
executes
Work
group
‒ A
SIMD
engine
(44x4*)
‒ #
of
processing
elements
!=
#
of
work
items
y Compute
device
executes
Kernel
‒ A
GPU
‒ #
of
compute
units
!=
#
of
work
groups
*
On
AMD
Radeon
R9
290X
17
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Work
item
GPU
CPU
Processing
element
Compute
Unit
(4)
Work
group
Kernel
.
.
.
Processing
element
Compute
Unit
(64)
...
18. HIGH
LEVEL
DESCRIPTION
y Today’s
GPU
is
similar
to
a
CPU
(if
you
look
at
very
high
level)
‒ GPU
is
an
extremely
wide
CPU
‒ Many
cores
‒ Wide
SIMD
y AMD
Radeon
R9
290X
GPU
‒ 176
=
44x4
SIMD
engines
(cores)
‒ 64
wide
SIMD
y But
different
in
‒ SIMD
width
(very
wide)
‒ Limited
local
resources
‒ Strategy
to
hide
latency
y Knowing
those
are
the
key
to
exploit
the
performance
18
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
19. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
19
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
20. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
20
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
21. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
21
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
22. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
22
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
23. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
23
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
24. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
24
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
25. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
25
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
26. WIDE
SIMD
EXECUTION
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
L
L
L
26
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
J
J
L
J
27. SIMD
DIVERGENCE
y SIMD
execu8on
=
Program
counter
is
shared
among
SIMD
lanes
y If
it
diverges
in
branches,
HW
u8liza8on
decreases
a
lot
(Gets
easier
to
diverge
on
wide
SIMD)
J
J
J
27
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
int
funcA()
{
int
value
=
0;
int
a
=
computeA();
if(
a
==
0
)
value
=
compute0();
else
if(
a
==
1
)
value
=
compute1();
else
if(
a
==
2
)
value
=
compute2();
else
if(
a
==
3
)
value
=
compute3();
return
value;
}
Lane0
Lane1
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
J
J
J
J
28. LATENCY
y Highest
latency
is
from
memory
access
y CPU
prevent
it
by
having
larger
cache
‒ Latency
of
cache
access
is
small
(fast)
y Most
of
the
memory
access
do
not
go
to
memory
y CPU
can
run
at
full
speed
un8l
a
cache
miss
y #
of
concurrent
execu8on
on
the
GPU
is
far
much
larger
than
CPU
‒ More
than
11k
(=
44x4x64)
work
items
y GPU
cache
is
not
large
enough
to
absorb
memory
requests
from
those
if
they
all
requests
different
part
of
memory
y Strategy
‒ Keep
memory
access
as
local
as
possible
(not
realis8c
for
prac8cal
apps)
‒ Uses
GPU
mechanism
for
latency
hiding
28
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
29. GPU
LATENCY
HIDING
y GPU
can
execute
at
full
speed
if
there
are
only
ALU
instruc8ons
(Inst.
0
-‐
2)
*
y Stalls
on
memory
access
instruc8on
(Inst.
3)
*
Can
hide
latency
using
logical
vector
29
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Inst.
0
Inst.
1
Inst.
2
Inst.
3
Lane0
LLaannee11
Lane2
Lane3
(MemAccess)
Inst.
4
30. GPU
LATENCY
HIDING
y When
stalled,
switch
to
another
work
group
y Could
fill
the
stall
with
instruc8ons
from
WG1
y A
SIMD
of
GPU
needs
to
process
mul8ple
WGs
at
the
same
8me
to
hide
latency
(or
maximize
its
throughput)
30
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
WG0:
Inst.
0
WG0:
Inst.
1
WG0:
Inst.
2
WG0:
Inst.
3
Lane0
LLaannee11
Lane2
Lane3
WG0:
Inst.
4
WG1:
Inst.
0
WG1:
Inst.
1
WG1:
Inst.
2
WG1:
Inst.
3
Lane0
LLaannee11
Lane2
Lane3
31. HOW
MANY
WGS
CAN
WE
EXECUTE
PER
SIMD
y 10
wavefronts
(64WIs)
per
SIMD
is
the
max
y It
depends
on
local
resource
usage
of
the
kernel
y VGPR
usage
is
ozen
the
problem
y Share
256
VGPRs
among
n
work
groups
‒ 1
wavefront,
256VGPRs
LL
‒ 2
wavefronts,
128VGPRs
‒ 4
wavefronts,
64VGPRs
J
‒ 10
wavefronts,
25VGPRs
y Share
16KB
LDS
among
n
work
groups
‒ 1
work
group,
16KB
LL
‒ 2
work
group,
8KB
‒ 4
work
group,
4KB
J
31
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y VGPRs
‒ Registers
used
by
vector
ALUs
‒ 64KB/SIMD
‒ 256
VGPRs/SIMD
lane
(=
64KB/64/4)
y LDS
(Local
data
share)
‒ 64KB/CU
(CU
==
4SIMD)
‒ 32KB/SIMD
32. ADVICE
TO
REDUCE
VGPR
PRESSURE
GET
MORE
PERFORMANCE
FROM
GPU
y Don’t
write
a
large
kernel
y If
the
program
can
be
split
into
several
pieces,
split
them
into
several
kernels
‒ Single
kernel
approach
‒ VGPR
usage
of
the
kernel
is
200
=
max(60,
200,
10)
‒ 1
wavefront
per
SIMD
‒ Bad
for
latency
hiding
‒ Mul8ple
kernel
approach
‒ FuncA:
4
wavefronts
per
SIMD
‒ FuncB:
1
wavefronts
per
SIMD
‒ FuncC:
10
wavefronts
per
SIMD
‒ FuncB
is
bad,
but
FuncA,
FuncC
runs
fast
y Helps
compiler
too
32
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Single
Kernel
FuncA
(60VGPRS)
FuncB
(200VGPRS)
FuncC(10VGPRS)
Mul8ple
Kernels
FuncA
(60VGPRS)
FuncB
(200VGPRS)
FuncC(10VGPRS)
L
L
L
J
L
J
33. VLIW4
(NI),
VLIW5
(EG)
Scalar
Architecture
(SI,
CI)
Lane
1
Lane
2
Lane
3
Lane
4
33
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
W
WHAT
IS
THE
SCALAR
ARCHITECTURE?
y Best
for
vector
computa8on
y Low
efficiency
on
scalar
computa8on
y Physical
concurrent
execu8on
‒ 16
work
items
‒ 4
ALU
opera8ons
each
‒ Total:
16x4
ALU
opera8ons
y 1
SIMD
is
running
in
a
CU
y Difficult
to
fill
xyzw
y If
not
filled,
we
waste
HW
cycle
y Good
for
both
vector
and
scalar
computa8on
y Need
more
work
groups
to
fill
GPU
y Physical
concurrent
execu8on
‒ 16x4
work
items
‒ 1
ALU
opera8on
each
‒ Total:
16x4
ALU
opera8ons
y 4
SIMDs
are
running
in
a
CU
y 4x
more
work
groups
are
necessary
to
fill
HW
X
Y
Z
W .
.
.
Lane
0
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
Lane
15
X
Y
Z
W
.
.
.
SIMD0
SIMD0
SIMD1
SIMD3
34. ANOTHER
SOLUTION
y Spli{ng
computa8on
into
mul8ple
kernels
‒ Primary
ray
gen
kernel
‒ Trace
kernel
‒ Evaluate
DI
kernel
‒ Etc
y BeXer
HW
u8liza8on
‒ Less
divergence
‒ Higher
HW
occupancy
34
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
35. OTHER
BENEFITS?
35
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Ray
Genera8on
First
Hit
First
Hit
(Normal)
Direct
Illumina8on
Indirect
Illumina8on
y Maintainability
‒ Debug
is
not
as
easy
as
we
do
on
C,
C++
‒ Not
all
compilers
are
mature
‒ Can
hit
to
a
compiler
bug,
which
is
hard
to
debug
‒ Helps
compiler
‒ By
spli{ng
kernels,
we
can
isolate
the
issue
‒ If
the
code
is
developed
by
many
people,
this
is
important
y Extendibility,
Portability
‒ Easy
to
extend
features
‒ Primary
Ray
Gen
Kernel
‒ Add
another
camera
projec8on
‒ Ray
Cas8ng
Kernel
‒ Easy
to
add
another
primi8ves
(e.g.,
vector
displacement)
‒ Take
it
out
for
physics
ray
cas8ng
queries
39. SPLITTING
KERNELS
Host
Code
Device
Code
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}
39
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
PrimaryRayGenKernel();
y Generate
rays
for
all
pixels
in
parallel
__kernel
void
TraceKernel();
y Compute
intersec8on
for
all
rays
in
parallel
__kernel
void
SampleLightKernel();
y Sample
light
for
all
hit
points
in
parallel
__kernel
void
AccumulateDItKernel();
y Accumulate
DI
for
all
hit
points
in
parallel
__kernel
void
SampleNextRayKernel();
y Generate
bounced
rays
for
all
hit
points
in
parallel
41. DESIGN
RAY
STATE
y Cannot
keep
state
between
kernels
y Ray
state
needs
to
be
saved
to/restored
from
global
memory
41
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y Example
‒ Ray
genera8on
+
ray
direc8on
visualiza8on
__kernel
void
PrimaryRayGenKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
//
save
}
__kernel
void
VisualizeRayKernel(__global
...)
{
RayState
s
=
gRayState[get_global_id(0)];
//
restore
Ray
ray
=
gRay[get_global_id(0)];
gFb[s.m_pixelIdx]
=
Ray_getDir(
ray
);
}
struct
RayState
{
float4
m_throughput;
int2
m_randomNumber;
int
m_pixelIdx;
};
42. DESIGN
RAY
STATE
y Cannot
keep
state
between
kernels
y Ray
state
needs
to
be
saved
to/restored
from
global
memory
In:
pixelIdx
In:
pixelIdx
In:
pixelIdx
Global
memory
(state)
42
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y Example
‒ Ray
genera8on
+
ray
direc8on
visualiza8on
__kernel
void
PrimaryRayGenKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
//
save
}
__kernel
void
VisualizeRayKernel(__global
...)
{
RayState
s
=
gRayState[get_global_id(0)];
//
restore
Ray
ray
=
gRay[get_global_id(0)];
gFb[s.m_pixelIdx]
=
Ray_getDir(
ray
);
}
Out:
Ray
Out:
Ray
Out:
Ray
In:
pixelIdx
Out:
Ray
PrimaryRayGenKernel
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
In:
Ray
Out:
Pixel
color
VisualizeKernel
43. RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
Primary
(3/8,
1/8,
4/8)
Secondary
43
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
44. RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
(3/8,
1/8,
4/8)
Primary
Secondary
y With
compac8on
‒ 1
SIMD
execu8on
‒ Occupancy
(7/8)
Primary
Secondary
*When
rays
are
created
for
all
pixels,
this
is
not
necessary
44
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
45. RAY
COMPACTION
y Sparse
data
lowers
SIMD
u8liza8on
y Without
compac8on
‒ 3
SIMD
execu8ons
‒ Occupancy
(3/8,
1/8,
4/8)
y With
compac8on
‒ 1
SIMD
execu8on
‒ Occupancy
(7/8)
*When
rays
are
created
for
all
pixels,
this
is
not
necessary
45
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y No
need
to
write
a
compac8on
kernel
y Can
compact
using
global
atomics
‒ Prepare
a
counter
(gRayCount)
‒ Perform
atomic
increment
to
reserve
memory
‒ BeXer
to
do
atomics
in
WG
first,
then
do
an
atomic
add
per
WG
__kernel
void
PrimaryRayGenKernel(__global
...)
{
ray,
rayState
=
PrimaryRayGen(
camera,
pixelLoc
);
int
dst
=
atom_inc(
&gRayCount
);
gRay[dst]
=
ray;
gRayState[dst]
=
rayState;
}
Primary
Secondary
Primary
Secondary
46. DIRECT
ILLUMINATION
COMPUTATION
y SampleLightKernel
‒ Want
to
keep
the
work
uniform
‒ Different
#
of
light
sample
per
ray
isn’t
good
‒ Compute
contribu8on
from
one
point
on
a
light
y Simple
approach
‒ Select
a
light
‒ Select
a
point
on
a
light
‒ Compute
DI
without
occlusion
term
y More
sophis8cated
light
sampling
‒ Using
poten8al
contribu8on
for
PDF
‒ Forward+
style
light
culling
46
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleLightKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
shadowRay,
lfnDotV
=
Light_Sample(
ray,
s
);
gShadowRay[GIDX]
=
shadowRay;
gDi[GIDX]
=
lfnDotV;
gRayState[GIDX]
=
s;
}
47. DIRECT
ILLUMINATION
COMPUTATION
y SampleLightKernel
y TraceRayKernel
‒ Check
if
the
point
on
the
light
is
visible
or
not
‒ Reuse
code
y AccumulateDIKernel
‒ If
the
ray
is
not
blocked,
accumulate
the
result
47
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleLightKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
shadowRay,
lfnDotV
=
Light_Sample(
ray,
s
);
gShadowRay[GIDX]
=
shadowRay;
gDi[GIDX]
=
lfnDotV;
gRayState[GIDX]
=
s;
}
__kernel
void
AccumulateDIKernel(__global
...)
{
Hit
shadowHit
=
gShadowHit[GIDX];
float4
di
=
gDi[GIDX];
if(
!shadowHit
)
gFb[GIDX]
+=
di;
}
48. SAMPLE
NEXT
RAY
y Compute
next
ray
by
sampling
BRDF
y Store
ray
and
ray
state
48
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
__kernel
void
SampleNextRayKernel(__global
...)
{
RayState
s
=
gRayState[GIDX];
Ray
ray
=
gRay[GIDX];
Hit
hit
=
gHit[GIDX];
if(
!hit
)
return;
nextRay,
s
=
Brdf_Sample(
ray,
s
);
int
dst
=
atom_inc(
&gRayCount
);
gRayNext[dst]
=
nextRay;
gRayStateNext[dst]
=
s;
}
49. TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
49
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Mesh0
xform0
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3
50. TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
y 2
level
BVH
‒ Top:
stores
an
object
in
a
leaf
(object
index,
transform)
‒ BoXom:
stores
a
primi8ve
(triangle,
quad)
in
a
leaf
50
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Bohom
BVH
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3
51. TRACE
KERNEL
y BVH
is
used
for
accelera8on
structure
‒ Index
is
used
to
describe
hierarchy
structure
(no
pointer)
0 1 2 3 4 5 6 7 8 9
y 2
level
BVH
‒ Top:
stores
an
object
in
a
leaf
(object
index,
transform)
‒ BoXom:
stores
a
primi8ve
(triangle,
quad)
in
a
leaf
y Store
those
BVHs
in
a
single
memory
‒ Traverse
top
tree
‒ Hit
a
leaf,
transform
the
ray
into
object
space
‒ Traverse
boXom
tree
‒ On
exit,
transform
the
ray
back
to
world
space
Bohom
A
Bohom
B
Bohom
C
Bohom
D
51
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
root
idx
Top
3
4
5
6
Top
BVH
Mesh0
xform0
Bohom
BVH
Mesh1
xform1
Mesh2
xform2
Mesh3
xform3
52. SO
FAR
y Explained
an
OpenCL
implementa8on
of
a
simple
path
tracer
y Easy
to
extend
from
here
y Extension
can
be
done
by
swapping
one
or
two
kernels
‒ Material
system,
Shader
‒ Light
sampling
‒ Support
for
different
type
of
primi8ves
‒ Ray
caster
+
spa8al
accelera8on
structure
52
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
54. INSTANCING
y Powerful
technique
to
increase
geometric
complexity
y Small
memory
overhead
‒ Shares
geometric
informa8on
(vertex,
normal
etc)
‒ Shares
BVH
‒ Stores
object
transform
54
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Mesh0
xform1
Mesh0
xform2
Mesh1
xform3
55. INSTANCING
y Powerful
technique
to
increase
geometric
complexity
y Small
memory
overhead
‒ Shares
geometric
informa8on
(vertex,
normal
etc)
‒ Shares
BVH
‒ Stores
object
transform
55
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
0
Bohom
A
Bohom
B
Top
1
2
3
4
5
6
Top
BVH
Mesh0
xform0
Mesh0
xform1
Mesh0
xform2
Mesh1
xform3
Bohom
BVH
56. LAYERED
MATERIAL
y Binary
tree
of
BRDFs
y Leaf
node
‒ BRDF
y Internal
node
‒ Blend
func8on
‒ Fresnel
blend,
Linear
blend
y Evaluate
one
BRDF
at
a
8me
‒ Traverse
binary
tree
‒ Random
sampling
at
internal
node
56
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Reflect
Diffuse
0.5
0.5
Microfacet
pdf=0.25
pdf=0.5
0.5
0.5
pdf=0.25
57. LAYERED
MATERIAL
y Binary
tree
of
BRDFs
y Leaf
node
‒ BRDF
y Internal
node
‒ Blend
func8on
‒ Fresnel
blend,
Linear
blend
y Evaluate
one
BRDF
at
a
8me
‒ Traverse
binary
tree
‒ Random
sampling
at
internal
node
57
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
PrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SelectBRDFKernel
);
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
}
59. VR
y Latency
is
super
important
y To
improve
a
frame
rendering
8me,
‒ Used
mul8ple
GPUs
‒ Foveated
rendering
y More
than
60fps
on
4
Hawaii
GPUs
‒ 6M
triangles
‒ 32
shadow
rays/sample
‒ 2
AA
rays/sample
59
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
60. VR
y Latency
is
important
y To
improve
a
frame
rendering
8me,
‒ Used
mul8ple
GPUs
‒ Foveated
rendering
y More
than
60fps
on
4
Hawaii
GPUs
‒ 6M
triangles
‒ 32
shadow
rays/sample
‒ 2
AA
rays/sample
60
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
{
launch(
VRPrimaryRayGenKernel
);
while(1)
{
launch(
TraceKernel
);
if(
!any(
hits
)
)
break;
launch(
SampleLightKernel
);
launch(
TraceKernel
);
launch(
AccumulateDIKernel
);
launch(
SampleNextRayKernel
);
}
launch(
FillPixelKernel
);
}
61. DISPLACEMENT
MAPPING
y Powerful
technique
to
increase
geometric
complexity
y Pre
tessella8on
‒ Required
memory
is
too
large
‒ GPU
memory
is
too
small
y Direct
ray
tracing
‒ When
hit
a
patch,
tessellate
and
displace
Base
mesh
Vector
displacement
map
With
vector
displacement
61
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Fig.
from
hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
62. DISPLACEMENT
MAPPING
y Powerful
technique
to
increase
geometric
complexity
y Pre
tessella8on
‒ Required
memory
is
too
large
‒ GPU
memory
is
too
small
y Direct
ray
tracing
‒ When
hit
a
patch,
tessellate
and
displace
y To
amor8ze
tessella8on,
displacement
cost,
batch
ray
intersec8on
y Need
to
change
TraceKernel
62
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
63. DISPLACEMENT
MAPPING
y TraceKernel
‒ If
a
ray
hit
a
quad
with
displacement
map,
save
(ray,
primi8ve)
to
a
buffer
‒ Sort
(ray,
primi8ve)
pairs
by
primi8ve
index
‒ Process
primi8ves
in
the
list
in
parallel
y For
each
patch
‒ Build
quad
BVH
in
parallel
‒ Cast
rays
in
parallel
y Key
is
work
buffer
memory
alloca8on
63
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
Level
0
(1
node)
BVH
Comt
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Level
2
(16
nodes)
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Ray
Cast
Level
1
(4
nodes)
BVH
Comt
BVH
Comt
BVH
Comt
BVH
Comt
64. VECTOR
DISPLACEMENT
IN
ACTION
Base
mesh
Vector
displacement
64
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
52GB
memory
if
pre
tessella8on
is
used
65. OPEN
SHADING
LANGUAGE
y OSL
itself
has
nothing
to
do
with
OpenCL
y Many
use
cases
y Using
OSL
in
OCL
renderer
‒ Translate
OSL
to
‒ OCL
kernel
‒ SPIR
‒ Feed
those
to
OCL
run8me
‒ clBuildProgram
‒ clCreateKernel
65
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
y OSL
example
surface
maXe
[[
string
descrip8on
=
"Lamber8an
diffuse
material"
]]
(float
Kd
=
1
[[float
UImin
=
0,
float
UIsozmax
=
1
]],
color
Cs
=
1
[[float
UImin
=
0,
float
UImax
=
1
]],
string
texname
=
“diffuse.tex”
[[int
texture_slot
=
1]] )
{
Ci
=
Kd
*
Cs
*
noise(5.0
*
P)
*
diffuse
(N);
}
66. SPIR
y Standard
Portable
Intermediate
Representa8on
y Based
on
LLVM
IR
(32,
64)
y Useful
to
ship
OpenCL
Apps
y Device
independent
y OpenCL
did
not
have
usable
binary
code
representa8on
‒ Binary
for
each
device
x
driver
‒ Combina8on
explode
‒ Embed
kernel
as
string
‒ Load
source,
clCreateProgramWithSource
‒ Dump
binary,
clGetProgramInfo
+
CL_PROGRAM_BINARIES
‒ Load
binary,
clCreateProgramWithBinary
y OpenCL
implementa8on
has
to
support
cl_khr_spir
extension
‒ Works
on
AMD,
Intel
(OpenCL
1.2)
‒ SPIR
2.0
is
coming
with
OpenCL
2.0
66
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
67. SPIR
CREATE
SPIR
BINARY
y Offline
compiler
‒ clang-‐spir*
-‐cc1
-‐emit-‐llvm-‐bc
-‐triple
spir-‐unknown-‐unknown
-‐cl-‐spir-‐compile-‐op8ons
”-‐x
spir"
-‐include
<opencl_spir.h>
-‐o
<output>
<input>
‒ clBuildProgram
with
“-‐x
spir
-‐spir-‐std=CL1.2”
y Use
host
OpenCL
API
‒ clCompileProgram
+
Op8on
‒ clGetProgramInfo
+
CL_PROGRAM_BINARIES
*hXps://github.com/KhronosGroup/SPIR
67
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014
68. 68
|
Introduc8on
to
Monte
Carlo
Ray
Tracing
OpenCL
implementa8on
|
SEPT
3,
2014