Baby Demuxed’s First Assembly Language Function

<Your Logo
Here>
Baby Demuxed’s First Assembly
Language Function
Kieran Kunhya <kierank@obe.tv>

<Your Logo
Here>
Who am I?
• Company specialising in software-based
encoders and decoders for Sport, News
and Channel contribution (B2B)
• Build everything in house:
– Hardware, firmware, software
• Not to be confused with:
• Written assembly in FFmpeg and at work

<Your Logo
Here>
What is assembly language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is (somewhat)
backwards compatible to the 8008 CPU
from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable versions
of machine code
• Jargon heavy, will try to keep to minimum

<Your Logo
Here>
Why does this matter?
• Single Instruction Multiple Data (SIMD) or
vector assembly functions are the backbone
of our industry.
• 10-20x speed improvements are a key
reason encoding/decoding is realtime
• This presentation will be followed by
Ronald’s more detailed dive into assembly
• FFmpeg/x264 assembly functions are some
of the most used functions in the cloud
• Fewer and fewer people writing assembly

<Your Logo
Here>
Assembly Language Concepts
• CPUs don’t operate directly on memory,
data needs to be loaded from memory
into registers, operations performed and
then stored back
• Scalar registers (general purpose
registers) operate on one element at a
time, vector (SIMD) registers operate on
multiple elements with a single
instruction
• Vector instructions very suited towards
2D images. Operate on multiple pixels at
a time
a
Scalar addition
+
b
Vector (SIMD)
addition
+
b d
a c
+
= a + b
f
e
+
a + b c + d e + f
=

<Your Logo
Here>
Assembly Language Concepts (2)
pshufb a b c
=
• Shuffles (permutes) are the most important instruction in multimedia
• Most assembly in multimedia is integer based
• X86 has myriad of instructions, many very specialist (e.g encryption)
• Not all CPUs capable of every instruction. New instruction sets added (and
even removed!). SSE2, SSSE3, AVX2 are names of some
d e f g h
c a f h a 0 d b
pshufb index chooses the element to output (or zero)

<Your Logo
Here>
Richardson, The H.264
Advanced Video Compression
Standard
• Famous Zigzag, used to convert a 2D
array to a 1D array, grouping larger
coefficients towards the beginning
• A 4x4 zigzag can be implemented with a
simple shuffle (16-bit coefficients)
Square brackets = “read
from memory” (*ptr in C)

<Your Logo
Here>
• Intrinsics, an abstraction of assembly also commonly used
– (controversial) Around 15% slower than assembly itself
– Some instructions not representable in intrinsics
• How do you implement functions?
• Calling convention (aka. Application Binary Interface):
do_something(arg1, arg2, arg3);
RDI = arg1, RSI = arg2, RDX = arg3.
• Agreed location of function arguments
• But could define own for performance improvements

<Your Logo
Here>
Let’s write an assembly function
• A decoder needs to predict the next block when intra (de)coding
• Match prediction of encoder
Richardson, The H.264 Advanced
Video Compression Standard
Definitely not to scale

<Your Logo
Here>
Let’s write an assembly function (2)
• Replicate the left hand pixel across all
pixels in a row A A A A A A A A
B B B B B B B B
A
B
C C C C C C C C
D D D D D D D D
C
D
E E E E E E E E
F F F F F F F F
E
F
G G G G G G G G
H H H H H H H H
G
H
8x8 horizontal prediction
Loop counter
decrement
Jump if greater
than zero
2 arguments, 3
GPRs in use

<Your Logo
Here>
-
- - - A - - -
Load from 8 bytes (4 words) to register
movq = move quadword
-
A A A A - - -
Shuffle low words and replicate
pshuflw = packed shuffle low words
A
A A A A A A A
punpcklqdq = unpack and interleave
low quadword (with itself)
Write register data back to memory
Increment memory location by two lines
xmm registers (16-byte)
sse2 instruction set

<Your Logo
Here>
Author has decided to do the operations twice per loop,
Known as loop unrolling
RET = exits the function and goes back to the calling code

<Your Logo
Here>
Benchmarks
• Test suite (checkasm) to verify correctness and run benchmarks
• Benchmarks (decicycles):
pred8x8_horizontal_10_c: 35.5
pred8x8_horizontal_10_sse2: 17.5
2x faster than C!

<Your Logo
Here>
Other Benchmarks
• Pixel packing function for custom hardware (10-bit bitpacked):
uyvy_to_sdi_c: 3672.0
uyvy_to_sdi_ssse3: 368.0
uyvy_to_sdi_avx: 181.0
uyvy_to_sdi_avx2: 129.0
uyvy_to_sdi_avx512icl: 59.0
62x faster than C!

<Your Logo
Here>
Conclusion
• Assembly functions are an important part of making encoding
and decoding realtime or cost-effective
• High-schoolers have written many of these functions
• Ability to get very large speed gains
• If this expertise goes away, it goes away forever
• Only talked about x86, but there are new platforms like ARM and
RISC-V with their own assembly language

Baby Demuxed’s First Assembly Language Function

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Baby Demuxed’s First Assembly Language Function

Similar to Baby Demuxed’s First Assembly Language Function (20)

More from Kieran Kunhya

More from Kieran Kunhya (16)

Recently uploaded

Recently uploaded (20)

Baby Demuxed’s First Assembly Language Function