MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos

REALTIME
4K
HDR
DECODING
WITH
GPU
ACES

GARY
DEMOS

IMAGE
ESSENCE
LLC

4k
Real4me
(24fps
2D)
Image
Bandwidth

•
Exr
half-‐ﬂoat
(e.g.
ACES/OCES)
or
16-‐bit
unsigned
short
integers:

-‐
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
1.27GBytes/sec
=
10.2gbps

•
32-‐bit
ﬂoats
(used
inside
OpenCL
in
the
GPU
and
within
most
CPU
decoding
steps):

-‐
4Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
2.54GBytes/sec
=
20.4gbps

•
10-‐bit
dpx-‐packed
pixels:

-‐
4Bytes/3cols
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
.85GBytes/sec
=
6.8gbps

Future
Fron4ers

•
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
60fps
=
3.19GBytes/sec
=
25.5gbps

•
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
120fps
=
6.37GBytes/sec
=
51.0gbps

•
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
24fps
=
5.10GBytes/sec
=
40.8gbps

•
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
120fps
=
25.48GBytes/sec
=
203.8gbps

•
3D
any
of
the
above
x2

•
DisplayPort
1.2
goes
up
to
20gbps

•
A
W9000
has
six
DisplayPort
1.2
outputs

•
The
demonstra4on
system
has
four
W9000’s

•
That’s
24
DisplayPort
1.2
outputs!

•
Total
available
pixel
output
is
24
x
20gbps
=
480gbps

•
That’s
more
than:

-‐
2x
(3D)
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
120fps
=
51.0GBytes/sec
=
407.7gbps!

-‐
Could
work
up
to
this
in
an
array
of
displays

•
S4ll
a
few
issues
(at
least
for
this
author):

-‐
Locking
playback
speed
with
pixels
from
CL

-‐
Synchronizing
audio

Real4me
Floa4ng
Point
ACES
Decoding

Including
Real4me
Interac4ve
Adjustment
and

RRT/ODT
in
the
GPU

2x
Intel
E5-‐2690
CPUs

Compressed

Bidiles

(SATA
FlashRam)

4k

Real4me

10/12-‐bits

RGB

DVS

Atomix

Floa4ng

Point

Decoding

ACES

Packed
Pixels

Ready
for
Display

Fifo
of
Frames

For
Smooth
Playout

4x
FirePro
W9000s

GPU
Processing
in
OpenCL

•
Sharpen/soeen
spa4al
ﬁlter

•
Transform
to
P3
Colorspace

•
ASC
CDL
adjustments

•
Transform
back
to
ACES

•
RRT
and
ODT
in
3D
LUT

•
Fix
and
pack
pixels

CPU
Par44oning

•
Running
Scien4ﬁc
Linux
6.4

•
Relying
on
a
ﬁfo-‐of-‐frames
in
the
DVS
Atomix
using
the
FIFO-‐API

to
smooth
out
the
non-‐real4me
ahributes
of
Linux

•
Mul4ple
decoder
processes
forked
at
startup

•
Compressed
bidiles
are
retrieved
by
each
process
from
SATA
FlashRAM/SSD

•
The
number
of
decoder
processes
is
selected
at
run4me
startup

(tuned
for
performance
and
available
memory)

CPU
Par44oning
(cont.)

•
Parent
process
becomes
display
process

•
Display
process
creates
shared
memory
and
sends
semaphores

to
decoder
processes
that
buﬀers
are
available

•
Each
decoder
process
creates
a
frame
or
range
of
frames

•
A
display
process
manages
shared
memory
and
DMA

to/from
GPU’s
and
DVS
Atomix

•
Display
process
tells
decoder
processes
when
buﬀers
again

become
available

GPU
Par44oning:

•
numDevices
OpenCL
call
provides
the
number
of
GPU’s
available

•
Ver4cal
screen
height
par44oned
into
numDevices

•
Four
Firepro
W9000
GPUs
in
this
demonstra4on
system

•
All
GPUs
share
a
common
“context”
and
associated
“kernels”

(one
CL
interpret)

•
Each
of
the
four
GPUs
given
a
“command_queue”
and
separate

“cl_mem”
buﬀers

GPU
Par44oning
(cont.)

•
Kernel
args
for
each
cl_mem
are
updated
for
each
of
the
four
GPUs
before

invoking
the
kernel
with
that
GPU’s
command_queue

•
Each
GPU
given
1/4
of
screen
height
EnqueuedWrites
of
half-‐float
ACES

•
Each
GPU’s
packed
pixels
retrieved
into
appropriate
quarter

of
screen
height
via
EnqueuedReads
of
packed
pixels

•
Double-‐buffered
DMA
(getbuffer/putbuffer)
to
DVS
Atomix
using

FIFO
API
(fifo
of
frames
helps
smooth
linux
non-‐real4me
aspects

yielding
real4me)

OpenCL
Code:

•
Macros
are
used
for
all
math

•
For
CPU
code,
“.h”
ﬁles
are
included
and
macros
invoked

•
For
GPU
code,
cl
includes
the
same
“.h”
ﬁles,
and
macros
invoked
with

each
cl
kernel

•
Macros
separated
into
various
types:

-‐
Interac4on
processing,
ACES
to/from
P3
and
ASC_CDL
applied
in
P3

-‐
RRT
(Reference
Rendering
Transform)
processing,

using
LUT
(faster
but
less
accurate,
real4me
at
4k)
or

direct
computa4on
(slower
but
highly
accurate,
real4me
at
2k)

-‐
ODT
(Output
Device
Transform)
processing,
for
the
type

of
ODT
selected

OpenCL
Code
(cont.)

•
Final
step
in
cl
is
32-‐bit
ﬂoats
to
ﬁx,
and
RGB
packing
(either
10bits
or
16bits),

adding
+-‐1/2lsb
noise
dither

•
OpenCL
does
not
include
a
random
number
intrinsic,
so
random
numbers

for
dithering
are
DMA’d
up
to
the
GPU
for
use
in
noise
dither,
using
a

randomizing
func4on
of
frame
number
and
scanline

Reasons
for
liking
OpenCL:

•
Support
for
DEVICE_TYPE_CPU
as
well
as
DEVICE_TYPE_GPU

•
Vendor
independence

•
Portability

•
Easily
extended
to
automa4cally
u4lize
mul4ple
GPU’s
by
seqng
up

mul4ple
command
queues
based
upon
number
of
devices
detected
at
run4me

•
Run4me
interpret
is
oeen
convenient

•
Excellent
descrip4on
of
expected
precision
for
math
intrinsic
func4ons

•
Strong
support
for
both
32-‐bit
and
64-‐bit
ﬂoa4ng
point

Reasons
for
liking
OpenCL
(cont.)

•
Well-‐thought-‐out
device
and
system
query
capabili4es

•
getGlobalID
provides
an
excellent
mechanism
for
parallelism

without
requiring
further
considera4on
of
lower
level
hardware
organiza4on

•
Easy
speciﬁca4on
of
global,
constant,
and
local
datatypes

•
Pipelining
control
via
blocking
and
non-‐blocking
read
and
write
queues

and
via
clFinish
and
kernel
barriers

•
First-‐class
support
of
half-‐ﬂoat
using
vload_half
and
vstore_half

Weaknesses
of
OpenCL
(aka
“wish
list”):

•
Diﬃcult
to
obtain
visibility
during
debugging

(although
print
statements
available
on
some
systems
with
DEVICE_TYPE_CPU)

•
No
detail
provided
by
“out
of
resources”
error

(e.g.
what
resources
are
we
out
of?)

Weaknesses
of
OpenCL
(aka
“wish
list”,
cont.):

•
Lack
of
visibility
during
performance
tuning

-‐
How
much
4me
is
being
spent
in
read/write
queues
to/from
CPU?

-‐
How
full
are
global
and
constant
memory?

-‐
How
much
global
memory
bandwidth
is
being
u4lized?

-‐
How
full
are
registers?

-‐
If
caches
are
present,
how
eﬀec4ve
are
they
on
a
given
kernel?

-‐
Are
there
unnecessary
waits
that
could
be
async
overlapped?

•
The
4,
8,
16
CL
SIMD
types
are
not
mirrored
in
CPU
SSE/AVX/F16
intrinsics.

-‐
Were
they
to
be
iden4cal,
they
could
be
used
in
macros
that

are
included
in
common
between
CL
kernels
and
CPU
threads

System
Performance:

•
Limited
by
memory
and
bus
bandwidth
issues

•
DirectGMA
will
improve
this

•
Plenty
of
GPU
power
s4ll
available
for
real4me
4k
processing

when
using
3D
LUT
RRT/ODT

•
CPU
power
sufficient
for
wavelet-‐only
floa4ng
point
decoding

at
4k

•
CPU
power
sufficient
for
mo4on-‐compensated
flowfield
sinc-‐and-‐wavelet

full
configura4on
at
2k.

Speed
is
about
1/3
real4me
at
4k.

•
With
threads
and
forked
processes,
will
be
able
to
take
advantage

of
an4cipated
major
increase
in
computa4onal
cores

CL/GL
Interop
Explora4on:

•
Using
X11
on
Linux
(no
glut
support)

•
Get
10-‐bit
depth
at
setup
from
X11
(as
configura4on
using
GLXChooseFBConfigs)

•
Uses
GL,
GLX,
and
CL/GL
context
(some
of
this
is
recent,
as
of
CL
1.2)

•
Improves
(reduces)
memory
transfer
amount
required
by
direct
output
from
GPU

•
Can
take
over
the
screen
(using
X11
XChangeProperty)

•
Relies
on
“FrameBufferObject”
and
“Acquire”
and
“Release”
by
CL

(Release
by
CL
implies
re-‐acquire
by
GL,
must
CLFinish
and
GLFinish
correspondingly)

•
Can
support
4k
at
10bits
via
DisplayPort
1.2
(and
HDMI
1.4a
via
DP
to
HDMI
dongle)

•
Reportedly
can
be
used
with
MacOSX
and
Windows
(with
X11-‐style
constructs)

CL/GL
Interop
Weaknesses:

•
Limited
to
single
GPU
for
CL
when
using
a
CL/GL
FBO

-‐
Would
be
nice
to
have
separate
FBO
quadrant
output
from
each
of
the
four
GPU’s

•
Not
smooth
4ming
if
GPU
running
near
capacity

•
No
locked
audio
sync

•
No
“ﬁfo-‐of-‐frames”
to
smooth
out
the
non-‐smooth-‐non-‐real4me
Linux
behavior

-‐
Working
on
using
cl_gl
event
sync
to
simulate
this

Ahributes
of
the
floa4ng
point
codec

•
Layered
with
5
layers
up
to
base
layer
at
1k
using
wavelets

•
Two
more
layers
from
1k
to
2k
and
2k
to
4k
built
with
sinc
filters,

using
wavelet
stacks
to
code
the
up-‐res
deltas

•
Base
and
up-‐res
layers
can
be
mo4on
compensated
(sinc
filter

is
phase-‐neutral
and
sub-‐pixel
displacement
precision
to
1/100
pixel)

Floa4ng
Point
Codec
(cont.)

•
Flowﬁeld
is
used
at
low
resolu4on
for
mo4on
displacement,

coded
also
as
wavelet
stack.

Upsized
for
each
layer
when
applied.

•
Floa4ng
point
coding
is
automa4cally
adap4ve
to
gamma,

since
a
ﬂoa4ng
point
quan4za4on
scale
is
used
for
each
image
region

using
the
average
and
minimum
brightness

•
YUV
encoding
takes
advantage
of
codec’s
unlimited
range
and
nega4ve

number
reproduc4on
to
support
full
ACES
gamut
and
dynamic
range

Fron4ers
for
using
the
available
GPU
power:

•
Spectral
color
processing
to
improve
upon
CIE
1931
limita4ons

•
More
ODTs
to
take
advantage
of
new
HD
and
UHD
displays

and
new
projectors
and
projec4on
light
sources
as
they
increase

dynamic
range
and
gamut

•
More

processing
in
the
pipeline

-‐
more
elaborate
sharpening

-‐
dynamic
range
regional
contrast
adapta4on

-‐
addi4onal
interac4ve
controls

-‐
adapta4on
to
viewing
surround
(if
not
dark
surround)

•
Addi4onal
work
on
the
RRT,
and
on
exis4ng
ODT
types

(in
conjunc4on
with
the
RRT
algorithmic
modiﬁca4ons)

Many
Thanks
to
the
AMD/ATI
FirePro
Professional
Graphics
Group
For
Their
Support

Many
Thanks
to
AMD/APU
team
for
providing
4k
Display

Thanks
also
to
R&S/DVS

ACES
Overview:

hhp://www.oscars.org/science-‐technology/council/projects/pdf/ACESOverview.pdf

Reference
papers
for
Gary
Demos:

•
The
Unfolding
Merger
of
Television
and
Movie
Technology

SMPTE
Conference,
Oct
2012

•
File
and
Folder
InteracMve
Decoding

SMPTE
Conference,
Oct
2011,
including
YouTube
Video:

hQp://www.youtube.com/watch?v=Ggt_8qseGtw

•
Layered
MoMon
CompensaMon
SMPTE
Journal,
Jan
2009

DISCLAIMER
&
ATTRIBUTION

The
informa4on
presented
in
this
document
is
for
informa4onal
purposes
only
and
may
contain
technical
inaccuracies,
omissions
and
typographical
errors.

The
informa4on
contained
herein
is
subject
to
change
and
may
be
rendered
inaccurate
for
many
reasons,
including
but
not
limited
to
product
and
roadmap

changes,
component
and
motherboard
version
changes,
new
model
and/or
product
releases,
product
differences
between
differing
manufacturers,
soeware

changes,
BIOS
flashes,
firmware
upgrades,
or
the
like.
AMD
assumes
no
obliga4on
to
update
or
otherwise
correct
or
revise
this
informa4on.
However,
AMD

reserves
the
right
to
revise
this
informa4on
and
to
make
changes
from
4me
to
4me
to
the
content
hereof
without
obliga4on
of
AMD
to
no4fy
any
person
of

such
revisions
or
changes.

AMD
MAKES
NO
REPRESENTATIONS
OR
WARRANTIES
WITH
RESPECT
TO
THE
CONTENTS
HEREOF
AND
ASSUMES
NO
RESPONSIBILITY
FOR
ANY

INACCURACIES,
ERRORS
OR
OMISSIONS
THAT
MAY
APPEAR
IN
THIS
INFORMATION.

AMD
SPECIFICALLY
DISCLAIMS
ANY
IMPLIED
WARRANTIES
OF
MERCHANTABILITY
OR
FITNESS
FOR
ANY
PARTICULAR
PURPOSE.
IN
NO
EVENT
WILL
AMD
BE

LIABLE
TO
ANY
PERSON
FOR
ANY
DIRECT,
INDIRECT,
SPECIAL
OR
OTHER
CONSEQUENTIAL
DAMAGES
ARISING
FROM
THE
USE
OF
ANY
INFORMATION

CONTAINED
HEREIN,
EVEN
IF
AMD
IS
EXPRESSLY
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGES.

ATTRIBUTION

©
2013
Advanced
Micro
Devices,
Inc.
All
rights
reserved.
AMD,
the
AMD
Arrow
logo
and
combina4ons
thereof
are
trademarks
of
Advanced
Micro
Devices,

Inc.
in
the
United
States
and/or
other
jurisdic4ons.

SPEC

is
a
registered
trademark
of
the
Standard
Performance
Evalua4on
Corpora4on
(SPEC).
Other

names
are
for
informa4onal
purposes
only
and
may
be
trademarks
of
their
respec4ve
owners.

24
|

PRESENTATION
TITLE

|

NOVEMBER
19,
2013

|

CONFIDENTIAL

MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos

Similar to MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos