Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
1. REALTIME
4K
HDR
DECODING
WITH
GPU
ACES
GARY
DEMOS
IMAGE
ESSENCE
LLC
2.
4k
Real4me
(24fps
2D)
Image
Bandwidth
•
Exr
half-‐float
(e.g.
ACES/OCES)
or
16-‐bit
unsigned
short
integers:
-‐
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
1.27GBytes/sec
=
10.2gbps
•
32-‐bit
floats
(used
inside
OpenCL
in
the
GPU
and
within
most
CPU
decoding
steps):
-‐
4Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
2.54GBytes/sec
=
20.4gbps
•
10-‐bit
dpx-‐packed
pixels:
-‐
4Bytes/3cols
x
3cols(RGB)
x
4096
x
2160
x
24fps
=
.85GBytes/sec
=
6.8gbps
3. Future
Fron4ers
•
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
60fps
=
3.19GBytes/sec
=
25.5gbps
•
2Bytes/col
x
3cols(RGB)
x
4096
x
2160
x
120fps
=
6.37GBytes/sec
=
51.0gbps
•
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
24fps
=
5.10GBytes/sec
=
40.8gbps
•
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
120fps
=
25.48GBytes/sec
=
203.8gbps
•
3D
any
of
the
above
x2
4. •
DisplayPort
1.2
goes
up
to
20gbps
•
A
W9000
has
six
DisplayPort
1.2
outputs
•
The
demonstra4on
system
has
four
W9000’s
•
That’s
24
DisplayPort
1.2
outputs!
•
Total
available
pixel
output
is
24
x
20gbps
=
480gbps
5.
•
That’s
more
than:
-‐
2x
(3D)
2Bytes/col
x
3cols(RGB)
x
8192
x
4320
x
120fps
=
51.0GBytes/sec
=
407.7gbps!
-‐
Could
work
up
to
this
in
an
array
of
displays
•
S4ll
a
few
issues
(at
least
for
this
author):
-‐
Locking
playback
speed
with
pixels
from
CL
-‐
Synchronizing
audio
6. Real4me
Floa4ng
Point
ACES
Decoding
Including
Real4me
Interac4ve
Adjustment
and
RRT/ODT
in
the
GPU
2x
Intel
E5-‐2690
CPUs
Compressed
Bidiles
(SATA
FlashRam)
4k
Real4me
10/12-‐bits
RGB
DVS
Atomix
Floa4ng
Point
Decoding
ACES
Packed
Pixels
Ready
for
Display
Fifo
of
Frames
For
Smooth
Playout
4x
FirePro
W9000s
GPU
Processing
in
OpenCL
•
Sharpen/soeen
spa4al
filter
•
Transform
to
P3
Colorspace
•
ASC
CDL
adjustments
•
Transform
back
to
ACES
•
RRT
and
ODT
in
3D
LUT
•
Fix
and
pack
pixels
7.
CPU
Par44oning
•
Running
Scien4fic
Linux
6.4
•
Relying
on
a
fifo-‐of-‐frames
in
the
DVS
Atomix
using
the
FIFO-‐API
to
smooth
out
the
non-‐real4me
ahributes
of
Linux
•
Mul4ple
decoder
processes
forked
at
startup
•
Compressed
bidiles
are
retrieved
by
each
process
from
SATA
FlashRAM/SSD
•
The
number
of
decoder
processes
is
selected
at
run4me
startup
(tuned
for
performance
and
available
memory)
8. CPU
Par44oning
(cont.)
•
Parent
process
becomes
display
process
•
Display
process
creates
shared
memory
and
sends
semaphores
to
decoder
processes
that
buffers
are
available
•
Each
decoder
process
creates
a
frame
or
range
of
frames
•
A
display
process
manages
shared
memory
and
DMA
to/from
GPU’s
and
DVS
Atomix
•
Display
process
tells
decoder
processes
when
buffers
again
become
available
9.
GPU
Par44oning:
•
numDevices
OpenCL
call
provides
the
number
of
GPU’s
available
•
Ver4cal
screen
height
par44oned
into
numDevices
•
Four
Firepro
W9000
GPUs
in
this
demonstra4on
system
•
All
GPUs
share
a
common
“context”
and
associated
“kernels”
(one
CL
interpret)
•
Each
of
the
four
GPUs
given
a
“command_queue”
and
separate
“cl_mem”
buffers
10.
GPU
Par44oning
(cont.)
•
Kernel
args
for
each
cl_mem
are
updated
for
each
of
the
four
GPUs
before
invoking
the
kernel
with
that
GPU’s
command_queue
•
Each
GPU
given
1/4
of
screen
height
EnqueuedWrites
of
half-‐float
ACES
•
Each
GPU’s
packed
pixels
retrieved
into
appropriate
quarter
of
screen
height
via
EnqueuedReads
of
packed
pixels
•
Double-‐buffered
DMA
(getbuffer/putbuffer)
to
DVS
Atomix
using
FIFO
API
(fifo
of
frames
helps
smooth
linux
non-‐real4me
aspects
yielding
real4me)
11.
OpenCL
Code:
•
Macros
are
used
for
all
math
•
For
CPU
code,
“.h”
files
are
included
and
macros
invoked
•
For
GPU
code,
cl
includes
the
same
“.h”
files,
and
macros
invoked
with
each
cl
kernel
•
Macros
separated
into
various
types:
-‐
Interac4on
processing,
ACES
to/from
P3
and
ASC_CDL
applied
in
P3
-‐
RRT
(Reference
Rendering
Transform)
processing,
using
LUT
(faster
but
less
accurate,
real4me
at
4k)
or
direct
computa4on
(slower
but
highly
accurate,
real4me
at
2k)
-‐
ODT
(Output
Device
Transform)
processing,
for
the
type
of
ODT
selected
12.
OpenCL
Code
(cont.)
•
Final
step
in
cl
is
32-‐bit
floats
to
fix,
and
RGB
packing
(either
10bits
or
16bits),
adding
+-‐1/2lsb
noise
dither
•
OpenCL
does
not
include
a
random
number
intrinsic,
so
random
numbers
for
dithering
are
DMA’d
up
to
the
GPU
for
use
in
noise
dither,
using
a
randomizing
func4on
of
frame
number
and
scanline
13.
Reasons
for
liking
OpenCL:
•
Support
for
DEVICE_TYPE_CPU
as
well
as
DEVICE_TYPE_GPU
•
Vendor
independence
•
Portability
•
Easily
extended
to
automa4cally
u4lize
mul4ple
GPU’s
by
seqng
up
mul4ple
command
queues
based
upon
number
of
devices
detected
at
run4me
•
Run4me
interpret
is
oeen
convenient
•
Excellent
descrip4on
of
expected
precision
for
math
intrinsic
func4ons
•
Strong
support
for
both
32-‐bit
and
64-‐bit
floa4ng
point
14.
Reasons
for
liking
OpenCL
(cont.)
•
Well-‐thought-‐out
device
and
system
query
capabili4es
•
getGlobalID
provides
an
excellent
mechanism
for
parallelism
without
requiring
further
considera4on
of
lower
level
hardware
organiza4on
•
Easy
specifica4on
of
global,
constant,
and
local
datatypes
•
Pipelining
control
via
blocking
and
non-‐blocking
read
and
write
queues
and
via
clFinish
and
kernel
barriers
•
First-‐class
support
of
half-‐float
using
vload_half
and
vstore_half
15.
Weaknesses
of
OpenCL
(aka
“wish
list”):
•
Difficult
to
obtain
visibility
during
debugging
(although
print
statements
available
on
some
systems
with
DEVICE_TYPE_CPU)
•
No
detail
provided
by
“out
of
resources”
error
(e.g.
what
resources
are
we
out
of?)
16.
Weaknesses
of
OpenCL
(aka
“wish
list”,
cont.):
•
Lack
of
visibility
during
performance
tuning
-‐
How
much
4me
is
being
spent
in
read/write
queues
to/from
CPU?
-‐
How
full
are
global
and
constant
memory?
-‐
How
much
global
memory
bandwidth
is
being
u4lized?
-‐
How
full
are
registers?
-‐
If
caches
are
present,
how
effec4ve
are
they
on
a
given
kernel?
-‐
Are
there
unnecessary
waits
that
could
be
async
overlapped?
•
The
4,
8,
16
CL
SIMD
types
are
not
mirrored
in
CPU
SSE/AVX/F16
intrinsics.
-‐
Were
they
to
be
iden4cal,
they
could
be
used
in
macros
that
are
included
in
common
between
CL
kernels
and
CPU
threads
17. System
Performance:
•
Limited
by
memory
and
bus
bandwidth
issues
•
DirectGMA
will
improve
this
•
Plenty
of
GPU
power
s4ll
available
for
real4me
4k
processing
when
using
3D
LUT
RRT/ODT
•
CPU
power
sufficient
for
wavelet-‐only
floa4ng
point
decoding
at
4k
•
CPU
power
sufficient
for
mo4on-‐compensated
flowfield
sinc-‐and-‐wavelet
full
configura4on
at
2k.
Speed
is
about
1/3
real4me
at
4k.
•
With
threads
and
forked
processes,
will
be
able
to
take
advantage
of
an4cipated
major
increase
in
computa4onal
cores
18. CL/GL
Interop
Explora4on:
•
Using
X11
on
Linux
(no
glut
support)
•
Get
10-‐bit
depth
at
setup
from
X11
(as
configura4on
using
GLXChooseFBConfigs)
•
Uses
GL,
GLX,
and
CL/GL
context
(some
of
this
is
recent,
as
of
CL
1.2)
•
Improves
(reduces)
memory
transfer
amount
required
by
direct
output
from
GPU
•
Can
take
over
the
screen
(using
X11
XChangeProperty)
•
Relies
on
“FrameBufferObject”
and
“Acquire”
and
“Release”
by
CL
(Release
by
CL
implies
re-‐acquire
by
GL,
must
CLFinish
and
GLFinish
correspondingly)
•
Can
support
4k
at
10bits
via
DisplayPort
1.2
(and
HDMI
1.4a
via
DP
to
HDMI
dongle)
•
Reportedly
can
be
used
with
MacOSX
and
Windows
(with
X11-‐style
constructs)
19.
CL/GL
Interop
Weaknesses:
•
Limited
to
single
GPU
for
CL
when
using
a
CL/GL
FBO
-‐
Would
be
nice
to
have
separate
FBO
quadrant
output
from
each
of
the
four
GPU’s
•
Not
smooth
4ming
if
GPU
running
near
capacity
•
No
locked
audio
sync
•
No
“fifo-‐of-‐frames”
to
smooth
out
the
non-‐smooth-‐non-‐real4me
Linux
behavior
-‐
Working
on
using
cl_gl
event
sync
to
simulate
this
20. Ahributes
of
the
floa4ng
point
codec
•
Layered
with
5
layers
up
to
base
layer
at
1k
using
wavelets
•
Two
more
layers
from
1k
to
2k
and
2k
to
4k
built
with
sinc
filters,
using
wavelet
stacks
to
code
the
up-‐res
deltas
•
Base
and
up-‐res
layers
can
be
mo4on
compensated
(sinc
filter
is
phase-‐neutral
and
sub-‐pixel
displacement
precision
to
1/100
pixel)
21.
Floa4ng
Point
Codec
(cont.)
•
Flowfield
is
used
at
low
resolu4on
for
mo4on
displacement,
coded
also
as
wavelet
stack.
Upsized
for
each
layer
when
applied.
•
Floa4ng
point
coding
is
automa4cally
adap4ve
to
gamma,
since
a
floa4ng
point
quan4za4on
scale
is
used
for
each
image
region
using
the
average
and
minimum
brightness
•
YUV
encoding
takes
advantage
of
codec’s
unlimited
range
and
nega4ve
number
reproduc4on
to
support
full
ACES
gamut
and
dynamic
range
22. Fron4ers
for
using
the
available
GPU
power:
•
Spectral
color
processing
to
improve
upon
CIE
1931
limita4ons
•
More
ODTs
to
take
advantage
of
new
HD
and
UHD
displays
and
new
projectors
and
projec4on
light
sources
as
they
increase
dynamic
range
and
gamut
•
More
processing
in
the
pipeline
-‐
more
elaborate
sharpening
-‐
dynamic
range
regional
contrast
adapta4on
-‐
addi4onal
interac4ve
controls
-‐
adapta4on
to
viewing
surround
(if
not
dark
surround)
•
Addi4onal
work
on
the
RRT,
and
on
exis4ng
ODT
types
(in
conjunc4on
with
the
RRT
algorithmic
modifica4ons)
23. Many
Thanks
to
the
AMD/ATI
FirePro
Professional
Graphics
Group
For
Their
Support
Many
Thanks
to
AMD/APU
team
for
providing
4k
Display
Thanks
also
to
R&S/DVS
ACES
Overview:
hhp://www.oscars.org/science-‐technology/council/projects/pdf/ACESOverview.pdf
Reference
papers
for
Gary
Demos:
•
The
Unfolding
Merger
of
Television
and
Movie
Technology
SMPTE
Conference,
Oct
2012
•
File
and
Folder
InteracMve
Decoding
SMPTE
Conference,
Oct
2011,
including
YouTube
Video:
hQp://www.youtube.com/watch?v=Ggt_8qseGtw
•
Layered
MoMon
CompensaMon
SMPTE
Journal,
Jan
2009