SlideShare a Scribd company logo
1 of 16
Download to read offline
<Your Logo
Here>
Baby Demuxed’s First Assembly
Language Function
Kieran Kunhya <kierank@obe.tv>
<Your Logo
Here>
Who am I?
• Company specialising in software-based
encoders and decoders for Sport, News
and Channel contribution (B2B)
• Build everything in house:
– Hardware, firmware, software
• Not to be confused with:
• Written assembly in FFmpeg and at work
<Your Logo
Here>
What is assembly language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is (somewhat)
backwards compatible to the 8008 CPU
from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable versions
of machine code
• Jargon heavy, will try to keep to minimum
<Your Logo
Here>
Why does this matter?
• Single Instruction Multiple Data (SIMD) or
vector assembly functions are the backbone
of our industry.
• 10-20x speed improvements are a key
reason encoding/decoding is realtime
• This presentation will be followed by
Ronald’s more detailed dive into assembly
• FFmpeg/x264 assembly functions are some
of the most used functions in the cloud
• Fewer and fewer people writing assembly
<Your Logo
Here>
<Your Logo
Here>
Assembly Language Concepts
• CPUs don’t operate directly on memory,
data needs to be loaded from memory
into registers, operations performed and
then stored back
• Scalar registers (general purpose
registers) operate on one element at a
time, vector (SIMD) registers operate on
multiple elements with a single
instruction
• Vector instructions very suited towards
2D images. Operate on multiple pixels at
a time
a
Scalar addition
+
b
Vector (SIMD)
addition
+
b d
a c
+
= a + b
f
e
+
a + b c + d e + f
=
<Your Logo
Here>
Assembly Language Concepts (2)
pshufb a b c
=
• Shuffles (permutes) are the most important instruction in multimedia
• Most assembly in multimedia is integer based
• X86 has myriad of instructions, many very specialist (e.g encryption)
• Not all CPUs capable of every instruction. New instruction sets added (and
even removed!). SSE2, SSSE3, AVX2 are names of some
d e f g h
c a f h a 0 d b
pshufb index chooses the element to output (or zero)
<Your Logo
Here>
Assembly Language Concepts (3)
Richardson, The H.264
Advanced Video Compression
Standard
• Famous Zigzag, used to convert a 2D
array to a 1D array, grouping larger
coefficients towards the beginning
• A 4x4 zigzag can be implemented with a
simple shuffle (16-bit coefficients)
Square brackets = “read
from memory” (*ptr in C)
<Your Logo
Here>
Assembly Language Concepts (4)
• Intrinsics, an abstraction of assembly also commonly used
– (controversial) Around 15% slower than assembly itself
– Some instructions not representable in intrinsics
• How do you implement functions?
• Calling convention (aka. Application Binary Interface):
do_something(arg1, arg2, arg3);
RDI = arg1, RSI = arg2, RDX = arg3.
• Agreed location of function arguments
• But could define own for performance improvements
<Your Logo
Here>
Let’s write an assembly function
• A decoder needs to predict the next block when intra (de)coding
• Match prediction of encoder
Richardson, The H.264 Advanced
Video Compression Standard
Definitely not to scale
<Your Logo
Here>
Let’s write an assembly function (2)
• Replicate the left hand pixel across all
pixels in a row A A A A A A A A
B B B B B B B B
A
B
C C C C C C C C
D D D D D D D D
C
D
E E E E E E E E
F F F F F F F F
E
F
G G G G G G G G
H H H H H H H H
G
H
8x8 horizontal prediction
Loop counter
decrement
Jump if greater
than zero
2 arguments, 3
GPRs in use
<Your Logo
Here>
Let’s write an assembly function (3)
-
- - - A - - -
Load from 8 bytes (4 words) to register
movq = move quadword
-
A A A A - - -
Shuffle low words and replicate
pshuflw = packed shuffle low words
A
A A A A A A A
punpcklqdq = unpack and interleave
low quadword (with itself)
Write register data back to memory
Increment memory location by two lines
xmm registers (16-byte)
sse2 instruction set
<Your Logo
Here>
Let’s write an assembly function (4)
Author has decided to do the operations twice per loop,
Known as loop unrolling
RET = exits the function and goes back to the calling code
<Your Logo
Here>
Benchmarks
• Test suite (checkasm) to verify correctness and run benchmarks
• Benchmarks (decicycles):
pred8x8_horizontal_10_c: 35.5
pred8x8_horizontal_10_sse2: 17.5
2x faster than C!
<Your Logo
Here>
Other Benchmarks
• Pixel packing function for custom hardware (10-bit bitpacked):
uyvy_to_sdi_c: 3672.0
uyvy_to_sdi_ssse3: 368.0
uyvy_to_sdi_avx: 181.0
uyvy_to_sdi_avx2: 129.0
uyvy_to_sdi_avx512icl: 59.0
62x faster than C!
<Your Logo
Here>
Conclusion
• Assembly functions are an important part of making encoding
and decoding realtime or cost-effective
• High-schoolers have written many of these functions
• Ability to get very large speed gains
• If this expertise goes away, it goes away forever
• Only talked about x86, but there are new platforms like ARM and
RISC-V with their own assembly language

More Related Content

What's hot

Introduction to Return-Oriented Exploitation on ARM64 - Billy Ellis
Introduction to Return-Oriented Exploitation on ARM64 - Billy EllisIntroduction to Return-Oriented Exploitation on ARM64 - Billy Ellis
Introduction to Return-Oriented Exploitation on ARM64 - Billy EllisBillyEllis3
 
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tipsViller Hsiao
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernelguest547d74
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linuxsureskal
 
Percepio Tracealyzer for FreeRTOS on MiniZED
Percepio Tracealyzer for FreeRTOS on MiniZEDPercepio Tracealyzer for FreeRTOS on MiniZED
Percepio Tracealyzer for FreeRTOS on MiniZEDVincent Claes
 
Q2.12: Debugging with GDB
Q2.12: Debugging with GDBQ2.12: Debugging with GDB
Q2.12: Debugging with GDBLinaro
 
Introduction to open_sbi
Introduction to open_sbiIntroduction to open_sbi
Introduction to open_sbiNylon
 
Intel x86 Architecture
Intel x86 ArchitectureIntel x86 Architecture
Intel x86 ArchitectureChangWoo Min
 
Modes of 80386
Modes of 80386Modes of 80386
Modes of 80386aviban
 
Architecture of 8085 microprocessor
Architecture of 8085 microprocessorArchitecture of 8085 microprocessor
Architecture of 8085 microprocessorAMAN SRIVASTAVA
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 

What's hot (20)

Introduction to Return-Oriented Exploitation on ARM64 - Billy Ellis
Introduction to Return-Oriented Exploitation on ARM64 - Billy EllisIntroduction to Return-Oriented Exploitation on ARM64 - Billy Ellis
Introduction to Return-Oriented Exploitation on ARM64 - Billy Ellis
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
 
What is Bootloader???
What is Bootloader???What is Bootloader???
What is Bootloader???
 
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tips
 
Dynamic Linker
Dynamic LinkerDynamic Linker
Dynamic Linker
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernel
 
USB Drivers
USB DriversUSB Drivers
USB Drivers
 
X86 Architecture
X86 Architecture X86 Architecture
X86 Architecture
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Linux Usb overview
Linux Usb  overviewLinux Usb  overview
Linux Usb overview
 
Percepio Tracealyzer for FreeRTOS on MiniZED
Percepio Tracealyzer for FreeRTOS on MiniZEDPercepio Tracealyzer for FreeRTOS on MiniZED
Percepio Tracealyzer for FreeRTOS on MiniZED
 
Q2.12: Debugging with GDB
Q2.12: Debugging with GDBQ2.12: Debugging with GDB
Q2.12: Debugging with GDB
 
Introduction to open_sbi
Introduction to open_sbiIntroduction to open_sbi
Introduction to open_sbi
 
Intel x86 Architecture
Intel x86 ArchitectureIntel x86 Architecture
Intel x86 Architecture
 
Linux Device Tree
Linux Device TreeLinux Device Tree
Linux Device Tree
 
Modes of 80386
Modes of 80386Modes of 80386
Modes of 80386
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 
Architecture of 8085 microprocessor
Architecture of 8085 microprocessorArchitecture of 8085 microprocessor
Architecture of 8085 microprocessor
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Microprocessor - Intel Pentium Series
Microprocessor - Intel Pentium SeriesMicroprocessor - Intel Pentium Series
Microprocessor - Intel Pentium Series
 

Similar to Baby Demuxed’s First Assembly Language Function

Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuEstelaJeffery653
 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesAhmedMahjoub15
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
isa architecture
isa architectureisa architecture
isa architectureAJAL A J
 
ElixirでFPGAを設計する
ElixirでFPGAを設計するElixirでFPGAを設計する
ElixirでFPGAを設計するHideki Takase
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Igalia
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Itzik Kotler
 
07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilationAdam Husár
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Community
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Community
 

Similar to Baby Demuxed’s First Assembly Language Function (20)

Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Reverse Engineering 101
Reverse Engineering 101Reverse Engineering 101
Reverse Engineering 101
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphes
 
ISA.pptx
ISA.pptxISA.pptx
ISA.pptx
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
isa architecture
isa architectureisa architecture
isa architecture
 
ElixirでFPGAを設計する
ElixirでFPGAを設計するElixirでFPGAを設計する
ElixirでFPGAを設計する
 
מצגת פרויקט
מצגת פרויקטמצגת פרויקט
מצגת פרויקט
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
REDA services
REDA servicesREDA services
REDA services
 
07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 

More from Kieran Kunhya

Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Kieran Kunhya
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Kieran Kunhya
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegKieran Kunhya
 
Private 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and ElsewherePrivate 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and ElsewhereKieran Kunhya
 
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...Kieran Kunhya
 
5G for onboard racing car video
5G for onboard racing car video5G for onboard racing car video
5G for onboard racing car videoKieran Kunhya
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseKieran Kunhya
 
How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.Kieran Kunhya
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareKieran Kunhya
 
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Kieran Kunhya
 
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Kieran Kunhya
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...Kieran Kunhya
 
Don't just go IP - Go IT
Don't just go IP - Go ITDon't just go IP - Go IT
Don't just go IP - Go ITKieran Kunhya
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastKieran Kunhya
 
Implementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsImplementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsKieran Kunhya
 

More from Kieran Kunhya (16)

Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
Private 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and ElsewherePrivate 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and Elsewhere
 
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
 
5G for onboard racing car video
5G for onboard racing car video5G for onboard racing car video
5G for onboard racing car video
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
 
How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
 
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
 
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...
 
Don't just go IP - Go IT
Don't just go IP - Go ITDon't just go IP - Go IT
Don't just go IP - Go IT
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live Broadcast
 
Implementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsImplementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfalls
 
FOSS in Broadcast
FOSS in BroadcastFOSS in Broadcast
FOSS in Broadcast
 

Recently uploaded

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 

Recently uploaded (20)

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 

Baby Demuxed’s First Assembly Language Function

  • 1. <Your Logo Here> Baby Demuxed’s First Assembly Language Function Kieran Kunhya <kierank@obe.tv>
  • 2. <Your Logo Here> Who am I? • Company specialising in software-based encoders and decoders for Sport, News and Channel contribution (B2B) • Build everything in house: – Hardware, firmware, software • Not to be confused with: • Written assembly in FFmpeg and at work
  • 3. <Your Logo Here> What is assembly language? • A low-level (close to the hardware) programming language • Intel x86 assembly language is (somewhat) backwards compatible to the 8008 CPU from 1972! • Languages like C are compiled into Assembly Language • Mnemonics are human readable versions of machine code • Jargon heavy, will try to keep to minimum
  • 4. <Your Logo Here> Why does this matter? • Single Instruction Multiple Data (SIMD) or vector assembly functions are the backbone of our industry. • 10-20x speed improvements are a key reason encoding/decoding is realtime • This presentation will be followed by Ronald’s more detailed dive into assembly • FFmpeg/x264 assembly functions are some of the most used functions in the cloud • Fewer and fewer people writing assembly
  • 6. <Your Logo Here> Assembly Language Concepts • CPUs don’t operate directly on memory, data needs to be loaded from memory into registers, operations performed and then stored back • Scalar registers (general purpose registers) operate on one element at a time, vector (SIMD) registers operate on multiple elements with a single instruction • Vector instructions very suited towards 2D images. Operate on multiple pixels at a time a Scalar addition + b Vector (SIMD) addition + b d a c + = a + b f e + a + b c + d e + f =
  • 7. <Your Logo Here> Assembly Language Concepts (2) pshufb a b c = • Shuffles (permutes) are the most important instruction in multimedia • Most assembly in multimedia is integer based • X86 has myriad of instructions, many very specialist (e.g encryption) • Not all CPUs capable of every instruction. New instruction sets added (and even removed!). SSE2, SSSE3, AVX2 are names of some d e f g h c a f h a 0 d b pshufb index chooses the element to output (or zero)
  • 8. <Your Logo Here> Assembly Language Concepts (3) Richardson, The H.264 Advanced Video Compression Standard • Famous Zigzag, used to convert a 2D array to a 1D array, grouping larger coefficients towards the beginning • A 4x4 zigzag can be implemented with a simple shuffle (16-bit coefficients) Square brackets = “read from memory” (*ptr in C)
  • 9. <Your Logo Here> Assembly Language Concepts (4) • Intrinsics, an abstraction of assembly also commonly used – (controversial) Around 15% slower than assembly itself – Some instructions not representable in intrinsics • How do you implement functions? • Calling convention (aka. Application Binary Interface): do_something(arg1, arg2, arg3); RDI = arg1, RSI = arg2, RDX = arg3. • Agreed location of function arguments • But could define own for performance improvements
  • 10. <Your Logo Here> Let’s write an assembly function • A decoder needs to predict the next block when intra (de)coding • Match prediction of encoder Richardson, The H.264 Advanced Video Compression Standard Definitely not to scale
  • 11. <Your Logo Here> Let’s write an assembly function (2) • Replicate the left hand pixel across all pixels in a row A A A A A A A A B B B B B B B B A B C C C C C C C C D D D D D D D D C D E E E E E E E E F F F F F F F F E F G G G G G G G G H H H H H H H H G H 8x8 horizontal prediction Loop counter decrement Jump if greater than zero 2 arguments, 3 GPRs in use
  • 12. <Your Logo Here> Let’s write an assembly function (3) - - - - A - - - Load from 8 bytes (4 words) to register movq = move quadword - A A A A - - - Shuffle low words and replicate pshuflw = packed shuffle low words A A A A A A A A punpcklqdq = unpack and interleave low quadword (with itself) Write register data back to memory Increment memory location by two lines xmm registers (16-byte) sse2 instruction set
  • 13. <Your Logo Here> Let’s write an assembly function (4) Author has decided to do the operations twice per loop, Known as loop unrolling RET = exits the function and goes back to the calling code
  • 14. <Your Logo Here> Benchmarks • Test suite (checkasm) to verify correctness and run benchmarks • Benchmarks (decicycles): pred8x8_horizontal_10_c: 35.5 pred8x8_horizontal_10_sse2: 17.5 2x faster than C!
  • 15. <Your Logo Here> Other Benchmarks • Pixel packing function for custom hardware (10-bit bitpacked): uyvy_to_sdi_c: 3672.0 uyvy_to_sdi_ssse3: 368.0 uyvy_to_sdi_avx: 181.0 uyvy_to_sdi_avx2: 129.0 uyvy_to_sdi_avx512icl: 59.0 62x faster than C!
  • 16. <Your Logo Here> Conclusion • Assembly functions are an important part of making encoding and decoding realtime or cost-effective • High-schoolers have written many of these functions • Ability to get very large speed gains • If this expertise goes away, it goes away forever • Only talked about x86, but there are new platforms like ARM and RISC-V with their own assembly language