Foveated Ray Tracing for VR on Multiple GPUs

FOVEATED
RAY
TRACING
FOR
VR

ON
MULTIPLE
GPUS

TAKAHIRO
HARADA,
AMD

12/2014

2
|
DEC
3,
2014

INTRO

y  Ray
Tracing
+
Foveated
rendering
+
VR
+
MulGple
GPUs
==
A
lot
of
GPU
compute!!

y  Compute
ﬁlls
a
texture

y  Use
GL/CL
interop
to
display

3
|
DEC
3,
2014

GPU
RAY
TRACING

y  Everything
is
wriWen
in
compute

y  Our
renderer
is
100%
OpenCL

‒ Win,
Linux,
OSX

‒ GPU,
CPU

y  High
quality
rendering
compared
to
raster
graphics

5
|
DEC
3,
2014

GPU
RAY
TRACING

y  A
single
big
kernel

‒ Easy
to
port

‒ Works

y  Do
you
write
only
1
pixel
shader??

y  Drawbacks

‒ Performance
<=
SIMD
divergence,
GPU
occupancy
(uses
too
much
VGPRs)

‒ Maintainability

‒ Extendibility

‒ Portability

‒ Debug

y  MulGple
kernel
implementaGon

IMPLEMENTATION
CHOICES

6
|
DEC
3,
2014

HOW
MANY
WGS
CAN
WE
EXECUTE
PER
SIMD
(AMD
GPU)

y  10
wavefronts
(64WIs)
per
SIMD
is
the
max

y  It
depends
on
local
resource
usage
of
the
kernel

y  VGPR
usage
is
ofen
the
problem

y  Share
256
VGPRs
among
n
work
groups

‒ 1
wavefront,
256VGPRs
LL

‒ 2
wavefronts,
128VGPRs

‒ 4
wavefronts,
64VGPRs
J

‒ 10
wavefronts,
25VGPRs

y  Share
16KB
LDS
among
n
work
groups

‒ 1
work
group,
16KB
LL

‒ 2
work
group,
8KB

‒ 4
work
group,
4KB
J

y  VGPRs

‒ Registers
used
by
vector
ALUs

‒ 64KB/SIMD

‒ 256
VGPRs/SIMD
lane
(=
64KB/64/4)

y  LDS
(Local
data
share)

‒ 64KB/CU
(CU
==
4SIMD)

‒ 32KB/SIMD

7
|
DEC
3,
2014

GPU
RAY
TRACING

launch(
RayTraceKernel
);

__kernel
void
RayTraceKernel();

Host
Code

Device
Code

launch(
PrimaryRayGenKernel
);

while(1)

{

launch(
TraceKernel
);

if(
!any(
hits
)
)

break;

launch(
SampleLightKernel
);

launch(
TraceKernel
);

launch(
AccumulateDIKernel
);

launch(
SampleNextRayKernel
);

}

__kernel
void
PrimaryRayGenKernel()

__kernel
void
TraceKernel()

__kerenl
void
SampleLightKernel()

Single
kernel
implementa?on
Mul?ple
kernel
implementa?on

8
|
DEC
3,
2014

RAY
TRACING
+
VR

y  Ray
tracing
is
ﬂexible

y  Raster
graphics,
single
proj
matrix

y  Can
cast
rays
to
arbitrary
direcGon

y  Easy
to
set
up
VR

y  But
performance
isn’t
good
enough

y  ComputaGon
cost

‒ Scene
complexity

‒ #
of
samples
(rays)

Fully
ray
traced
but
using
baked
textures:)

9
|
DEC
3,
2014

RAY
TRACING
+
VR

y  Ray
tracing
is
ﬂexible

y  Raster
graphics,
single
proj
matrix

y  Can
cast
rays
to
arbitrary
direcGon

y  Easy
to
set
up
VR

y  But
performance
isn’t
good
enough

y  To
speed
it
up,

‒ Reduce
#
of
pixels
to
be
shaded

y  Pixel
shading
(sample)
reducGon

‒ Sample
reuse
(lef&right)

‒ Foveated
rendering

Fully
ray
traced
but
using
baked
textures:)

10
|
DEC
3,
2014

SAMPLE
REUSE

11
|
DEC
3,
2014

FOVEATED
RENDERING

y  We
can
only
see
clearly
where
we
are
looking
at

y  Shading
at
full
rate
everywhere
is
a
waste
of

computaGon

y  Steps

‒ Create
a
density
map

‒ Ray
trace
1
sample
for
each
area

‒ Reconstruct
full
resoluGon
image

12
|
DEC
3,
2014

FOVEATED
RENDERING

y  We
can
only
see
clearly
where
we
are
looking
at

y  Shading
at
full
rate
everywhere
is
a
waste
of

computaGon

y  Steps

‒ Create
a
density
map

‒ Ray
trace
1
sample
for
each
area

‒ Reconstruct
full
resoluGon
image

13
|
DEC
3,
2014

FOVEATED
RENDERING

y  We
can
only
see
clearly
where
we
are
looking
at

y  Shading
at
full
rate
everywhere
is
a
waste
of

computaGon

y  Steps

‒ Create
a
density
map

‒ Ray
trace
1
sample
for
each
area

‒ Reconstruct
full
resoluGon
image

15
|
DEC
3,
2014

1.
DENSITY
MAP
DATA
STRUCTURE

y  2
data
structures
are
precomputed

y  Array<ﬂoat2>
samples(
M
)

‒ Sample
posiGon

‒ Normalized
coordinate
(x,
y)

y  Array<NeighborInfo>
neighborInfo(
N
)

‒ For
frame
reconstrucGon

‒ Sample
id[k]

‒ Sample
weight[k]

y  #
of
pixels
:
N

y  #
of
samples:
M

16
|
DEC
3,
2014

2.
ASSIGN
A
UNIQUE
INDEX
FOR
EACH
SAMPLE

y  Execute
work
item
for
each
sample
in
the
paWern

y  Check
which
sample
is
in
the
rendered
area

y  Use
atomic
Inc
to
get
a
unique
index

‒ Count:
#
of
samples

‒ Unique
indices

As
mulGple
samples
are
taken
for
a
render(),
unique
indices
to
idenGfy
storage
locaGon
is
necessary

0
5
7
2
10
23
Samples

Ray

Color

22

7
Count

Rendering
Area

17
|
DEC
3,
2014

3.
GENERATE
PRIMARY
RAYS

y  Execute
work
item
for
each
sample
in
the
range

y  Read
sampleID

y  Read
sample
coordinates

y  Generate
a
primary
ray

y  Store
to
ray
buﬀer

0
5
7
2
10
23
Samples

Ray

Color

22

7
Count

18
|
DEC
3,
2014

4.
RAY
TRACE

y  Execute
work
item
for
each
generated
ray

y  Trace
ray
+
Shade

0
5
7
2
10
23
Samples

Ray

Color

22

7
Count

19
|
DEC
3,
2014

5.
RECONSTRUCT
FRAME
BUFFER

y  Execute
work
item
for
each
pixel

y  Weighted
blend
of
k
neighbors

y  Go
through
list
of
neighbors
and
fetch

computed
pixel
color

Input
Output

20
|
DEC
3,
2014

6.
APPLY
DISTORTION
AND
RENDER
LR

y  Render
to
LR

y  Execute
work
item
for
each
pixel
in
the
frame
buﬀer

y  Check
if
it
is
L
or
R

y  Look
up
pixel
value

y  ChromaGc
separaGon

y  Barrel
distorGon

21
|
DEC
3,
2014

RESULT

y  #
of
samples
are
reduced
to
5%
compared
to
full
rate
shading

y  Could
make
it
faster
(10~30fps)

y  SGll
not
fast
enough
for
VR

y  ReducGon
of
more
samples?

USING
MULTIPLE
GPUS

FOR
LATENCY
CRITICAL
APPLICATION

23
|
DEC
3,
2014

HOW
TO
USE
MULTIPLE
GPUS

y  Alternate
frame
rendering

‒ Assign
a
frame
rendering
for
a
GPU

‒ Time
to
ﬁnish
a
frame
doesn’t
change

y  Frame
split

‒ Split
a
frame
and
all
GPUs
work
on
the
frame

‒ Can
reduce
the
Gme
to
ﬁnish
a
frame

y  Frame
split
is
beWer
for
our
purpose

24
|
DEC
3,
2014

CHALLENGE
OF
FRAME
SPLIT

y  Load
balancing
issue

y  A
GPU
ﬁnishes
immediately,
another
might
keep
running
forever

y  Workload
of
each
pixel
can
be
diﬀerent

y  Foveated
rendering
makes
it
worse

‒ Shading
point
density
is
not
uniform
on
the
screen

25
|
DEC
3,
2014

SEMI
STATIC
LOAD
BALANCING

y  Load
balancing
once
for
each
frame
rendering
step

y  Use
staGsGcs
from
previous
frame
to
load
balance

y  Start
from
even
split

y  At
each
frame

‒ Render
the
assigned
area

‒ Each
GPU
reports
#
of
samples
processed
and
Gme
to
complete
the
work

‒ Compute
processing
speed
for
GPU
i,

‒  p_i
=
n_i/t_i

‒ If
we
use
the
perfect
load
balancing,
Gme
to
ﬁnish
the
work
is

‒  t
=
sum
n_i
/
sum
p_i

‒ The
work
for
GPU
i
can
process
at
t
is

‒ 
n_i
=
t
p_i

‒ Compute
next
frame
split
from
the
CDF
of
sample
distribuGon

Area

n0

n1

n2

n3

A0
A1
A2
A3

#
of
Samples

26
|
DEC
3,
2014

APPLYING
TO
FOVEATED
RENDERING

y  Samples
in
the
area
of
the
frame
buﬀer
is
not

enough

y  Sample
in
the
other
area
is
not
in
the
GPU

memory

y  We
need
to
reconstruct
frame
buﬀer
from

neighbor
samples

y  Gather
samples
which
have
at
least
1
neighbor

in
the
assigned
area

27
|
DEC
3,
2014

RESULT

y  More
than
60fps
on
4
GPUs

‒ 6M
triangles

‒ 32
shadow
rays/sample

‒ 2
AA
rays/sample

Crytek
Sponza
(0.26M
tris)

~12ms/frame

32
shadow
rays/sample

4x
AMD
FirePro
W9000
GPUs

Rungholt
(6.7M
tris)

~12ms/frame

32
shadow
rays/sample

4x
AMD
FirePro
W9000
GPUs

28
|
DEC
3,
2014

CLOSING
THE
TALK

y  Showed
an
example
of
rendering
pipeline
100%
wriWen
in
GPU
compute

y  Showed
how
to
extend
a
ray
tracer
for
VR

y  Showed
a
fully
manual
usage
of
mulGple
GPU

‒ ó
Fully
automaGc
by
driver
(Crossﬁre)

29
|
DEC
3,
2014

CLOSING
THE
TALK

y  Foveated
Real-‐Time
Ray
Tracing
for
Virtual
Reality
Headset

y  Ray
Tracing
Irregularly
Distributed
Samples
on
MulGple
GPUs

y  hWp://research.lighWransport.com/foveated-‐real-‐Gme-‐ray-‐tracing-‐for-‐virtual-‐reality-‐headset/index.html

y  Thanks
to
Masahiro
Fujita@Light
Transport
Entertainment
Inc.

Foveated Ray Tracing for VR on Multiple GPUs

More Related Content

What's hot

Viewers also liked

Similar to Foveated Ray Tracing for VR on Multiple GPUs

Recently uploaded

Foveated Ray Tracing for VR on Multiple GPUs