<Your Logo
Here>
Baby Demuxed’s First Assembly
Language Function
Kieran Kunhya <kierank@obe.tv>
<Your Logo
Here>
Who am I?
• Company specialising in software-based
encoders and decoders for Sport, News
and Channel contribution (B2B)
• Build everything in house:
– Hardware, firmware, software
• Not to be confused with:
• Written assembly in FFmpeg and at work
<Your Logo
Here>
What is assembly language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is (somewhat)
backwards compatible to the 8008 CPU
from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable versions
of machine code
• Jargon heavy, will try to keep to minimum
<Your Logo
Here>
Why does this matter?
• Single Instruction Multiple Data (SIMD) or
vector assembly functions are the backbone
of our industry.
• 10-20x speed improvements are a key
reason encoding/decoding is realtime
• This presentation will be followed by
Ronald’s more detailed dive into assembly
• FFmpeg/x264 assembly functions are some
of the most used functions in the cloud
• Fewer and fewer people writing assembly
<Your Logo
Here>
<Your Logo
Here>
Assembly Language Concepts
• CPUs don’t operate directly on memory,
data needs to be loaded from memory
into registers, operations performed and
then stored back
• Scalar registers (general purpose
registers) operate on one element at a
time, vector (SIMD) registers operate on
multiple elements with a single
instruction
• Vector instructions very suited towards
2D images. Operate on multiple pixels at
a time
a
Scalar addition
+
b
Vector (SIMD)
addition
+
b d
a c
+
= a + b
f
e
+
a + b c + d e + f
=
<Your Logo
Here>
Assembly Language Concepts (2)
pshufb a b c
=
• Shuffles (permutes) are the most important instruction in multimedia
• Most assembly in multimedia is integer based
• X86 has myriad of instructions, many very specialist (e.g encryption)
• Not all CPUs capable of every instruction. New instruction sets added (and
even removed!). SSE2, SSSE3, AVX2 are names of some
d e f g h
c a f h a 0 d b
pshufb index chooses the element to output (or zero)
<Your Logo
Here>
Assembly Language Concepts (3)
Richardson, The H.264
Advanced Video Compression
Standard
• Famous Zigzag, used to convert a 2D
array to a 1D array, grouping larger
coefficients towards the beginning
• A 4x4 zigzag can be implemented with a
simple shuffle (16-bit coefficients)
Square brackets = “read
from memory” (*ptr in C)
<Your Logo
Here>
Assembly Language Concepts (4)
• Intrinsics, an abstraction of assembly also commonly used
– (controversial) Around 15% slower than assembly itself
– Some instructions not representable in intrinsics
• How do you implement functions?
• Calling convention (aka. Application Binary Interface):
do_something(arg1, arg2, arg3);
RDI = arg1, RSI = arg2, RDX = arg3.
• Agreed location of function arguments
• But could define own for performance improvements
<Your Logo
Here>
Let’s write an assembly function
• A decoder needs to predict the next block when intra (de)coding
• Match prediction of encoder
Richardson, The H.264 Advanced
Video Compression Standard
Definitely not to scale
<Your Logo
Here>
Let’s write an assembly function (2)
• Replicate the left hand pixel across all
pixels in a row A A A A A A A A
B B B B B B B B
A
B
C C C C C C C C
D D D D D D D D
C
D
E E E E E E E E
F F F F F F F F
E
F
G G G G G G G G
H H H H H H H H
G
H
8x8 horizontal prediction
Loop counter
decrement
Jump if greater
than zero
2 arguments, 3
GPRs in use
<Your Logo
Here>
Let’s write an assembly function (3)
-
- - - A - - -
Load from 8 bytes (4 words) to register
movq = move quadword
-
A A A A - - -
Shuffle low words and replicate
pshuflw = packed shuffle low words
A
A A A A A A A
punpcklqdq = unpack and interleave
low quadword (with itself)
Write register data back to memory
Increment memory location by two lines
xmm registers (16-byte)
sse2 instruction set
<Your Logo
Here>
Let’s write an assembly function (4)
Author has decided to do the operations twice per loop,
Known as loop unrolling
RET = exits the function and goes back to the calling code
<Your Logo
Here>
Benchmarks
• Test suite (checkasm) to verify correctness and run benchmarks
• Benchmarks (decicycles):
pred8x8_horizontal_10_c: 35.5
pred8x8_horizontal_10_sse2: 17.5
2x faster than C!
<Your Logo
Here>
Other Benchmarks
• Pixel packing function for custom hardware (10-bit bitpacked):
uyvy_to_sdi_c: 3672.0
uyvy_to_sdi_ssse3: 368.0
uyvy_to_sdi_avx: 181.0
uyvy_to_sdi_avx2: 129.0
uyvy_to_sdi_avx512icl: 59.0
62x faster than C!
<Your Logo
Here>
Conclusion
• Assembly functions are an important part of making encoding
and decoding realtime or cost-effective
• High-schoolers have written many of these functions
• Ability to get very large speed gains
• If this expertise goes away, it goes away forever
• Only talked about x86, but there are new platforms like ARM and
RISC-V with their own assembly language

Baby Demuxed's First Assembly Language Function

  • 1.
    <Your Logo Here> Baby Demuxed’sFirst Assembly Language Function Kieran Kunhya <kierank@obe.tv>
  • 2.
    <Your Logo Here> Who amI? • Company specialising in software-based encoders and decoders for Sport, News and Channel contribution (B2B) • Build everything in house: – Hardware, firmware, software • Not to be confused with: • Written assembly in FFmpeg and at work
  • 3.
    <Your Logo Here> What isassembly language? • A low-level (close to the hardware) programming language • Intel x86 assembly language is (somewhat) backwards compatible to the 8008 CPU from 1972! • Languages like C are compiled into Assembly Language • Mnemonics are human readable versions of machine code • Jargon heavy, will try to keep to minimum
  • 4.
    <Your Logo Here> Why doesthis matter? • Single Instruction Multiple Data (SIMD) or vector assembly functions are the backbone of our industry. • 10-20x speed improvements are a key reason encoding/decoding is realtime • This presentation will be followed by Ronald’s more detailed dive into assembly • FFmpeg/x264 assembly functions are some of the most used functions in the cloud • Fewer and fewer people writing assembly
  • 5.
  • 6.
    <Your Logo Here> Assembly LanguageConcepts • CPUs don’t operate directly on memory, data needs to be loaded from memory into registers, operations performed and then stored back • Scalar registers (general purpose registers) operate on one element at a time, vector (SIMD) registers operate on multiple elements with a single instruction • Vector instructions very suited towards 2D images. Operate on multiple pixels at a time a Scalar addition + b Vector (SIMD) addition + b d a c + = a + b f e + a + b c + d e + f =
  • 7.
    <Your Logo Here> Assembly LanguageConcepts (2) pshufb a b c = • Shuffles (permutes) are the most important instruction in multimedia • Most assembly in multimedia is integer based • X86 has myriad of instructions, many very specialist (e.g encryption) • Not all CPUs capable of every instruction. New instruction sets added (and even removed!). SSE2, SSSE3, AVX2 are names of some d e f g h c a f h a 0 d b pshufb index chooses the element to output (or zero)
  • 8.
    <Your Logo Here> Assembly LanguageConcepts (3) Richardson, The H.264 Advanced Video Compression Standard • Famous Zigzag, used to convert a 2D array to a 1D array, grouping larger coefficients towards the beginning • A 4x4 zigzag can be implemented with a simple shuffle (16-bit coefficients) Square brackets = “read from memory” (*ptr in C)
  • 9.
    <Your Logo Here> Assembly LanguageConcepts (4) • Intrinsics, an abstraction of assembly also commonly used – (controversial) Around 15% slower than assembly itself – Some instructions not representable in intrinsics • How do you implement functions? • Calling convention (aka. Application Binary Interface): do_something(arg1, arg2, arg3); RDI = arg1, RSI = arg2, RDX = arg3. • Agreed location of function arguments • But could define own for performance improvements
  • 10.
    <Your Logo Here> Let’s writean assembly function • A decoder needs to predict the next block when intra (de)coding • Match prediction of encoder Richardson, The H.264 Advanced Video Compression Standard Definitely not to scale
  • 11.
    <Your Logo Here> Let’s writean assembly function (2) • Replicate the left hand pixel across all pixels in a row A A A A A A A A B B B B B B B B A B C C C C C C C C D D D D D D D D C D E E E E E E E E F F F F F F F F E F G G G G G G G G H H H H H H H H G H 8x8 horizontal prediction Loop counter decrement Jump if greater than zero 2 arguments, 3 GPRs in use
  • 12.
    <Your Logo Here> Let’s writean assembly function (3) - - - - A - - - Load from 8 bytes (4 words) to register movq = move quadword - A A A A - - - Shuffle low words and replicate pshuflw = packed shuffle low words A A A A A A A A punpcklqdq = unpack and interleave low quadword (with itself) Write register data back to memory Increment memory location by two lines xmm registers (16-byte) sse2 instruction set
  • 13.
    <Your Logo Here> Let’s writean assembly function (4) Author has decided to do the operations twice per loop, Known as loop unrolling RET = exits the function and goes back to the calling code
  • 14.
    <Your Logo Here> Benchmarks • Testsuite (checkasm) to verify correctness and run benchmarks • Benchmarks (decicycles): pred8x8_horizontal_10_c: 35.5 pred8x8_horizontal_10_sse2: 17.5 2x faster than C!
  • 15.
    <Your Logo Here> Other Benchmarks •Pixel packing function for custom hardware (10-bit bitpacked): uyvy_to_sdi_c: 3672.0 uyvy_to_sdi_ssse3: 368.0 uyvy_to_sdi_avx: 181.0 uyvy_to_sdi_avx2: 129.0 uyvy_to_sdi_avx512icl: 59.0 62x faster than C!
  • 16.
    <Your Logo Here> Conclusion • Assemblyfunctions are an important part of making encoding and decoding realtime or cost-effective • High-schoolers have written many of these functions • Ability to get very large speed gains • If this expertise goes away, it goes away forever • Only talked about x86, but there are new platforms like ARM and RISC-V with their own assembly language