This document discusses assembly language and provides an example of writing an assembly language function. It begins with introductions and definitions of assembly language concepts. It then walks through writing an 8x8 horizontal block prediction function in x86 assembly language. Benchmarks show the assembly function is 2x faster than a C implementation. Other examples show speedups of up to 62x faster than C for pixel packing functions. The conclusion emphasizes the importance of optimization through assembly language for real-time encoding and decoding.
2. <Your Logo
Here>
Who am I?
• Company specialising in software-based
encoders and decoders for Sport, News
and Channel contribution (B2B)
• Build everything in house:
– Hardware, firmware, software
• Not to be confused with:
• Written assembly in FFmpeg and at work
3. <Your Logo
Here>
What is assembly language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is (somewhat)
backwards compatible to the 8008 CPU
from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable versions
of machine code
• Jargon heavy, will try to keep to minimum
4. <Your Logo
Here>
Why does this matter?
• Single Instruction Multiple Data (SIMD) or
vector assembly functions are the backbone
of our industry.
• 10-20x speed improvements are a key
reason encoding/decoding is realtime
• This presentation will be followed by
Ronald’s more detailed dive into assembly
• FFmpeg/x264 assembly functions are some
of the most used functions in the cloud
• Fewer and fewer people writing assembly
6. <Your Logo
Here>
Assembly Language Concepts
• CPUs don’t operate directly on memory,
data needs to be loaded from memory
into registers, operations performed and
then stored back
• Scalar registers (general purpose
registers) operate on one element at a
time, vector (SIMD) registers operate on
multiple elements with a single
instruction
• Vector instructions very suited towards
2D images. Operate on multiple pixels at
a time
a
Scalar addition
+
b
Vector (SIMD)
addition
+
b d
a c
+
= a + b
f
e
+
a + b c + d e + f
=
7. <Your Logo
Here>
Assembly Language Concepts (2)
pshufb a b c
=
• Shuffles (permutes) are the most important instruction in multimedia
• Most assembly in multimedia is integer based
• X86 has myriad of instructions, many very specialist (e.g encryption)
• Not all CPUs capable of every instruction. New instruction sets added (and
even removed!). SSE2, SSSE3, AVX2 are names of some
d e f g h
c a f h a 0 d b
pshufb index chooses the element to output (or zero)
8. <Your Logo
Here>
Assembly Language Concepts (3)
Richardson, The H.264
Advanced Video Compression
Standard
• Famous Zigzag, used to convert a 2D
array to a 1D array, grouping larger
coefficients towards the beginning
• A 4x4 zigzag can be implemented with a
simple shuffle (16-bit coefficients)
Square brackets = “read
from memory” (*ptr in C)
9. <Your Logo
Here>
Assembly Language Concepts (4)
• Intrinsics, an abstraction of assembly also commonly used
– (controversial) Around 15% slower than assembly itself
– Some instructions not representable in intrinsics
• How do you implement functions?
• Calling convention (aka. Application Binary Interface):
do_something(arg1, arg2, arg3);
RDI = arg1, RSI = arg2, RDX = arg3.
• Agreed location of function arguments
• But could define own for performance improvements
10. <Your Logo
Here>
Let’s write an assembly function
• A decoder needs to predict the next block when intra (de)coding
• Match prediction of encoder
Richardson, The H.264 Advanced
Video Compression Standard
Definitely not to scale
11. <Your Logo
Here>
Let’s write an assembly function (2)
• Replicate the left hand pixel across all
pixels in a row A A A A A A A A
B B B B B B B B
A
B
C C C C C C C C
D D D D D D D D
C
D
E E E E E E E E
F F F F F F F F
E
F
G G G G G G G G
H H H H H H H H
G
H
8x8 horizontal prediction
Loop counter
decrement
Jump if greater
than zero
2 arguments, 3
GPRs in use
12. <Your Logo
Here>
Let’s write an assembly function (3)
-
- - - A - - -
Load from 8 bytes (4 words) to register
movq = move quadword
-
A A A A - - -
Shuffle low words and replicate
pshuflw = packed shuffle low words
A
A A A A A A A
punpcklqdq = unpack and interleave
low quadword (with itself)
Write register data back to memory
Increment memory location by two lines
xmm registers (16-byte)
sse2 instruction set
13. <Your Logo
Here>
Let’s write an assembly function (4)
Author has decided to do the operations twice per loop,
Known as loop unrolling
RET = exits the function and goes back to the calling code
14. <Your Logo
Here>
Benchmarks
• Test suite (checkasm) to verify correctness and run benchmarks
• Benchmarks (decicycles):
pred8x8_horizontal_10_c: 35.5
pred8x8_horizontal_10_sse2: 17.5
2x faster than C!
15. <Your Logo
Here>
Other Benchmarks
• Pixel packing function for custom hardware (10-bit bitpacked):
uyvy_to_sdi_c: 3672.0
uyvy_to_sdi_ssse3: 368.0
uyvy_to_sdi_avx: 181.0
uyvy_to_sdi_avx2: 129.0
uyvy_to_sdi_avx512icl: 59.0
62x faster than C!
16. <Your Logo
Here>
Conclusion
• Assembly functions are an important part of making encoding
and decoding realtime or cost-effective
• High-schoolers have written many of these functions
• Ability to get very large speed gains
• If this expertise goes away, it goes away forever
• Only talked about x86, but there are new platforms like ARM and
RISC-V with their own assembly language