This document discusses optimizing FFmpeg and Handbrake using OpenCL. It describes FFmpeg as a popular open-source multimedia software library used for recording, converting, and streaming audio and video. It was optimized to leverage heterogeneous computing by accelerating video decoding and encoding using hardware accelerators and accelerating video processing filters using the GPU. Specific filters were implemented in OpenCL for improved performance compared to CPU. Performance tests showed the accelerated FFmpeg approach achieved significantly higher frame rates than the original CPU-only FFmpeg.
2. FFMPEG
INTRODUCTION
! FFMPEG
is
a
very
popular
open
source
mulLmedia
soNware
library
used
to
record,
convert
and
stream
Audio
&
Video.
! Used
by
popular
OpenSource
projects
like
Handbrake,
VLC
player,
Chrome
etc.
! Single
stop
soluLon
for
‒ Decoding
different
codec
formats
(Audio
&
Video)
‒ Handling
various
container
formats
(mp4,
wmv,
avi,
m2ts,
m2ps
etc.)
‒ Encoding
to
popular
Video
&
Audio
codec
formats
(H.264,
VC-‐1,
Mpeg2
etc.)
‒ Different
video
filtering
algorithms
(Deshake,
Scale,
Unsharp
etc.)
‒ Managing
different
pixel
formats
(NV12,
RGB,
YV12
etc.)
‒ Cross-‐placorm
support
(Windows
and
Linux)
2
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
3. FFMPEG
–
TYPICAL
USAGE
SCENARIO
AND
PROCESSING
INVOLVED
Imagine
a
video
edit
using
FFMPEG
Video
Decode
3
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
Video
shake
removal
Sharp/Blur
Scale
Video
Encode
4. FFMPEG
–
TYPICAL
USAGE
SCENARIO
AND
PROCESSING
INVOLVED
Imagine
a
video
edit
using
FFMPEG
Video
Decode
Video
shake
removal
Sharp/Blur
Scale
GPU
HW
Decoder
CPU
AMD
APU
HETEROGENEOUS
SOLUTION
4
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
Video
Encode
HW
Encoder
5. FFMPEG
–
SCOPE
FOR
ACCELERATION
Leverage
Heterogeneous
compute
! Accelerate
Video
Decode
and
Encode
using
HW
accelerators
‒ Load
on
CPU
to
perform
decode
and
encode
is
taken
off
‒ Power
savings
=>
longer
baiery
life
! Accelerate
Video
Processing
filter
using
GPU
‒ Increased
performance
compared
to
CPU
implementaLon
‒ ApplicaLon
runs
at
higher
fps
‒ Possible
to
apply
more
filters
to
achieve
beier
video
quality
! Use
CPU
for
Serial
processing
and
control
‒ Efficient
usage
of
resources
5
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
6. FFMPEG
–
OUR
WORK
! AMD
and
MulLcoreware
Inc.,
worked
on
acceleraLng
FFMPEG
! Enable
usage
of
Hardware
decoder
‒ To
support
decoding
of
H.264,
VC-‐1,
MPEG2
and
Mpeg4
pt2
codecs
‒ Windows
‒ IntegraLon
of
DXVA2
API
to
ffmpeg.exe
‒ DXVA2
funcLonality
already
available
in
ffmpeg’s
libavcodec
library
‒ Extremely
difficult
for
applicaLon
developers
to
make
use
of
DXVA2
API
in
libavcodec
‒ Needs
deep
understanding
of
DXVA2
API
and
specific
codec
level
knowledge
‒ Coded
up
all
the
necessary
steps
needed
to
use
HW
decoder
using
DXVA2
in
ffmpeg.exe
app
‒ Created
a
command
line
opLon
for
ffmpeg.exe
to
enable
usage
of
HW
assisted
decode
! Make
use
of
DirectX(R)
9
to
OpenCLTM
interop
APIs
available
in
OpenCL1.2TM
‒ This
ensures
the
decoded
frame
is
retained
in
GPU
memory
and
passed
on
to
OpenCLTM
filter
6
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
7. FFMPEG
–
OUR
WORK
! Introduced
OpenCLTM
in
ffmpeg
‒ Created
OpenCLTM
infrastructure
in
libavuLl
to
enable
usage
of
OpenCLTM
in
ffmpeg
! AcceleraLon
of
Video
processing
filters
on
GPU
using
OpenCLTM
‒ Added
OpenCLTM
implementaLon
for
the
following
filters
in
libavfilter
‒ Deshake
-‐
This
filter
helps
remove
camera
shake
from
hand-‐holding
a
camera,
moving
on
a
vehicle,
etc.
‒ Unsharp
-‐
Sharpen
or
blur
the
input
video
‒ Scale
-‐
Scale
(resize)
the
input
video
‒ Denoise
-‐
High
precision/quality
3d
denoise
filter.
This
filter
aims
to
reduce
image
noise
producing
smooth
images
‒ Yadif
-‐
Deinterlace
the
input
video
‒ Lnterlace
-‐
temporal
field
interlacing
‒ Gradfun
-‐
Fix
the
banding
arLfacts
introduced
by
truncaLon
to
8bit
color
depth
! OpLmizaLon
of
ffmpeg
pipeline
to
run
decode,
filters
&
encode
in
parallel
7
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
8. FFMPEG
–
PERFORMANCE
" Performance
numbers
of
transcode
pipeline
using
ffmpeg
on
A10-‐6800K
APU
Accelerated
ffmpeg
55
60
57
Original
ffmpeg
(CPU)
FPS
50
29
40
30
22
20
10
1.3
0
8
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
23
16
1.2
9. FFMPEG
–
STATUS
! Ffmpeg
2.0
contains
OpenCL
work
‒ OpenCL
framework
in
libavuLl
‒ Deshake
and
unsharp
OpenCL
implementaLons
in
libavfilter
! DXVA2
patch
is
under
review
! Further
OpLmizaLons
and
tuning
in
progress
for
other
filters.
9
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
10. FFMPEG
–
CHALLENGES
! Introducing
OpenCL
into
ffmpeg
‒ Reviewers
were
not
well
versed
with
OpenCL
! Retaining
data
on
GPU
memory
in
the
pipeline
‒ Ffmpeg
soNware
architectural
changes
needed
for
this
! RecompilaLon
of
kernels
on
every
run
‒ Ffmpeg
does
not
allow
saving
compiled
binary
files
on
local
machine
! Ffmpeg
soNware
needs
pipeline
level
opLmizaLons
to
take
benefit
of
heterogeneous
placorm
10
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
11. FFMPEG
–
FUTURE
WORK
! Add
support
for
HW
assisted
encode
(H.264)
‒ AMD
is
going
to
give
out
C++
API
to
access
HW
Encoder
called
AMF
‒ More
details
available
in
the
talk
tomorrow
Innova'ng
with
AMD
Mul'media
Technologies
(MM-‐4095)
! OpLmize
OpenCL
implementaLon
of
filters
for
beier
performance
! Explore
using
HSA
features
to
boost
performance
! OpLmize
memory
transfers
‒ Retain
buffers
on
device
memory
across
Decode,
Filter
and
Encode
modules
11
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
13. WHAT
IS
HANDBRAKE?
! Open
Source
Video
Transcoder
! Converts
videos
from
most
popular
format
! Selectable
output
format
and
bitrates
! Video
Resizing
! Video
Filters
‒ Deinterlacing
‒ Decomb
‒ Deblock
‒ Grayscale
‒ Cropping
13
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
14. CURRENT
ENHANCEMENTS
! Hardware
Video
Decode
‒ Input
video
decoded
via
DXVA2
‒ ULlizes
UVD
on
AMD
GPUs
and
APUs
! OpenCL™
accelerated
Video
ResoluLon
changes
‒ Video
Frames
are
resized
using
OpenCL
kernels
‒ Example:
1080p
converted
to
720p
14
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
15. IMPROVING
OPENCL
SCALING
! The
OpenCL
Scaling
Enhancement
was
under-‐performing
! IdenLfied
Issues:
‒ Image
format
conversion
‒ Buffer
staging
‒ Separable
Scaling
using
two
kernels
15
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
16. OPENCL
SCALING
IMPROVEMENTS
Reduce
Memory
Copies:
! Modify
the
exisLng
HandBrake
buffer
system
! IdenLfy
which
buffers
will
contain
video
data
(vs.
audio,
capLons,
etc.)
! Video
buffers
are
allocated
out
of
pinned
Host
Memory
! Non-‐OpenCL
aware
code
writes
data
to
the
correct
place
! Kernels
can
directly
read/write
the
buffers
via
Zero
Copy
16
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
17. OPENCL
SCALING
IMPROVEMENTS
Switch
to
a
Single
Kernel:
! Eliminate
the
two
kernel
approach
! Process
blocks
of
data
rather
than
lines
! Support
HandBrake
naLve
image
packing
! Use
LDS
to
further
reduce
Global
Memory
accesses
17
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
18. RESULTS
! The
single
kernel
completes
quickly
! No
extra
memory
copies
are
required
! Kernel
execuLon
Lme
to
scale
one
frame
(1080p
-‐>
720p)*
‒ AMD
A10-‐6800K
–
2.4
ms
‒ AMD
HD7750
–
1.0
ms
! ApplicaLon
Performance
on
A10-‐6800K
Feature
Performance
(FPS)
Improvement
over
SW
SoNware
36.08
0.0
Scaling
39.64
9.9%
UVD
40.53
12.3%
Scaling
+
UVD
44.95
23.9%
*
All
Lmes
measured
on
a
development
system
18
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL
19. THANK
YOU
QuesLons
19
|
PRESENTATION
TITLE
|
November
19,
2013
|
CONFIDENTIAL