SlideShare a Scribd company logo
AVX-512 assembly in FFmpeg
Kieran Kunhya <kieran@kunhya.com>
What is Assembly Language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is
(somewhat) backwards compatible to
the 8008 CPU from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable
versions of machine code
What is AVX-512?
• New SIMD (Single Instruction Multiple Data) Instruction Set for Intel since
2017, and more recently AMD CPUs
• 512-bit register sizes (zmm)
• Many new instructions
• Opmasks
• Comparison instructions
• Many things not so interesting in multimedia (e.g cryptography, neural
networks)
• Lots of fancy words, but high schoolers have written assembly in FFmpeg!
Why is this relevant now?
• AVX-512 has been around since 2017. Why is it relevant
now?
• Old Skylake (server) systems had large performance
throttling when using larger registers. AVX-512 remained
unused in multimedia
• Could still use new instructions with smaller registers
• Can be beneficial in some cases (see later)
• Ice Lake (10th and 11th Gen Intel) first to have no throttling
How to get started?
• Intel have removed AVX-512 support
in consumer processors (from 12th
Gen) 
• Available in AMD Zen 4
• Still exists in Intel Server CPUs (Xeon).
Available from cloud providers
• Easiest way is to buy an Intel NUC
(11th Gen)
Existing work in multimedia using AVX-512
• dav1d project added AVX-512 support
• AV1 decoding, particularly beneficial owing to large block sizes
• 10-20% faster *overall* decoding
• Classic FFmpeg/x264 approach to assembly
• No intrinsics, nor inline assembly
• Detect CPU capabilities and set function pointers to appropriate
functions
• Messy Venn Diagram of capabilities, but two in practice we care
about
CPU_FLAGS in FFmpeg
int av_get_cpu_flags(void);
#define AV_CPU_FLAG_AVX512 0x100000 ///<
AVX-512 functions: requires OS support even if
YMM/ZMM registers aren't used
#define AV_CPU_FLAG_AVX512ICL 0x200000 ///<
F/CD/BW/DQ/VL/VNNI/IFMA/VBMI/VBMI2/VPOPCNTDQ/BIT
ALG/GFNI/VAES/VPCLMULQDQ
Lanes
• Older AVX ymm registers are split
into lanes
• Instructions (mainly) operate in lanes
• Can be tricky to move data between
lanes
• Limitation on AVX2 code
16-byte Lane 1 16-byte Lane 0
256-bit ymm Register (old)
K-mask registers
• A new set of registers k0-k7 that allow the destination
register to remain unchanged or set to zero
• For example, addition, but only some values:
• paddw zmm1{k1}, zmm2, zmm3
• kmovX instructions to manipulate k-mask registers
vpermb
• Byte shuffles (permute) are the one
of the most important instructions
in multimedia
• Similar to existing pshufb
instruction but cross-lane
• Need to use k-masks with vpermb
to zero out a byte
• e.g for zigzag scan
• Also, VPERMT2B, permute from
two registers
A B C D E F G H
E A 0 C C B H D
Variable shifts
• vpsrlvw/vpsrlvd/vpsrlvq – variable right shift (logical)
• vpsllvw/vpsllvd/vpsllvq – variable left shift (logical)
• (letter soup, especially arithmetic shifts!)
• Historically had to use multiple instructions and various trickery
to achieve right shifts. Had many limitations.
• Faster than multiply for left shifts
vpternlogd
• The kitchen sink of instructions!
• Allows a programmable truth table to be
implemented per bit input in each register
• Can replace up to eight instructions!
• e.g vpternlogd zmm0, zmm1, zmm2,
0xca
• zmm0 = zmm0 ? zmm1 : zmm2; (for each bit)
• http://0x80.pl/articles/avx512-ternary-
functions.html
Example: v210enc (1)
• .loop:
• movu ym1, [yq + 2*widthq]
• vinserti32x4 m1, [uq + 1*widthq], 2
• vinserti32x4 m1, [vq + 1*widthq], 3
• vpermb m1, m2, m1 ; uyvx yuyx vyux yvyx
• CLIPUB m1, m4, m5
•
pmaddubsw m0, m1, m3 ; shift high and low samples of each dword and mask out other bits
• pslld m1, 4 ; shift center sample of each dword
• vpternlogd m0, m1, m6, 0xd8 ; C?B:A ; merge and mask out bad bits from B
•
movu [dstq], m0
• […]
• jl .loop
• Takes 3x 8-bit samples, extends to 10-bit and packs into 32-bits
Example: v210enc (2)
• Benchmarks (decicyles)
• Skylake
• v210_planar_pack_8_c: 2373.5
• […]
• v210_planar_pack_8_avx2: 194.0
• v210_planar_pack_8_avx512: 174.0
• vpternlogd on a shorter ymm register
• Ice Lake
• v210_planar_pack_8_c: 2743.6
• v210_planar_pack_8_avx2: 246.6
• v210_planar_pack_8_avx512: 238.6
• v210_planar_pack_8_avx512icl: 122.1
• vpermb and zmm nearly twice as fast as avx512 ymm, and more than twenty times
faster than C!
What AVX-512 code next?
• Anything involving line/frame based processing
• e.g filters, scalers etc.
• Comparisons
• Vpternlogd in many places (e.g 3-way boolean)
• Also change variable shifts in many places
• Intel manual is very verbose (useful in some cases)
• https://www.officedaytime.com/simd512e/
Any questions?

More Related Content

What's hot

5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters
Yitzhak Bar-Geva
 
Intermediate: 5G Applications Architecture - A look at Application Functions ...
Intermediate: 5G Applications Architecture - A look at Application Functions ...Intermediate: 5G Applications Architecture - A look at Application Functions ...
Intermediate: 5G Applications Architecture - A look at Application Functions ...
3G4G
 
Nb iot presentation
Nb iot presentationNb iot presentation
Nb iot presentation
manoj pradhan
 
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Vietnam Open Infrastructure User Group
 
Propelling 5G forward: a closer look at 3GPP Release-16
Propelling 5G forward: a closer look at 3GPP Release-16Propelling 5G forward: a closer look at 3GPP Release-16
Propelling 5G forward: a closer look at 3GPP Release-16
Qualcomm Research
 
NVMe over Fabrics Demystified
NVMe over Fabrics Demystified NVMe over Fabrics Demystified
NVMe over Fabrics Demystified
Brad Eckert
 
3GPP Packet Core Towards 5G Communication Systems
3GPP Packet Core Towards 5G Communication Systems3GPP Packet Core Towards 5G Communication Systems
3GPP Packet Core Towards 5G Communication Systems
Ofinno
 
Beginners: 5G Terminology (Updated - Feb 2019)
Beginners: 5G Terminology (Updated - Feb 2019)Beginners: 5G Terminology (Updated - Feb 2019)
Beginners: 5G Terminology (Updated - Feb 2019)
3G4G
 
Advanced: 5G NR RRC Inactive State
Advanced: 5G NR RRC Inactive StateAdvanced: 5G NR RRC Inactive State
Advanced: 5G NR RRC Inactive State
3G4G
 
Next Generation Network Automation
Next Generation Network AutomationNext Generation Network Automation
Next Generation Network Automation
Laurent Ciavaglia
 
5G Core Network - ZTE 5g Cloude ServCore
5G Core Network - ZTE 5g Cloude ServCore5G Core Network - ZTE 5g Cloude ServCore
5G Core Network - ZTE 5g Cloude ServCore
ITU
 
Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel
Eiko Seidel
 
Evolution of M2M Communication
Evolution of M2M CommunicationEvolution of M2M Communication
Evolution of M2M Communication
Indaka Raigama
 
Introducing ultra-precise time for server-hosted applications
Introducing ultra-precise time for server-hosted applicationsIntroducing ultra-precise time for server-hosted applications
Introducing ultra-precise time for server-hosted applications
ADVA
 
Advanced: Control and User Plane Separation of EPC nodes (CUPS)
Advanced: Control and User Plane Separation of EPC nodes (CUPS)Advanced: Control and User Plane Separation of EPC nodes (CUPS)
Advanced: Control and User Plane Separation of EPC nodes (CUPS)
3G4G
 
5G Technology Tutorial
5G Technology Tutorial5G Technology Tutorial
5G Technology Tutorial
APNIC
 
ICN Akamai's Backbone
ICN Akamai's BackboneICN Akamai's Backbone
ICN Akamai's Backbone
APNIC
 
5 g core overview
5 g core overview5 g core overview
5 g core overview
Hemraj Kumar
 
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
3G4G
 

What's hot (20)

5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters
 
Intermediate: 5G Applications Architecture - A look at Application Functions ...
Intermediate: 5G Applications Architecture - A look at Application Functions ...Intermediate: 5G Applications Architecture - A look at Application Functions ...
Intermediate: 5G Applications Architecture - A look at Application Functions ...
 
Nb iot presentation
Nb iot presentationNb iot presentation
Nb iot presentation
 
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
 
Propelling 5G forward: a closer look at 3GPP Release-16
Propelling 5G forward: a closer look at 3GPP Release-16Propelling 5G forward: a closer look at 3GPP Release-16
Propelling 5G forward: a closer look at 3GPP Release-16
 
NVMe over Fabrics Demystified
NVMe over Fabrics Demystified NVMe over Fabrics Demystified
NVMe over Fabrics Demystified
 
3GPP Packet Core Towards 5G Communication Systems
3GPP Packet Core Towards 5G Communication Systems3GPP Packet Core Towards 5G Communication Systems
3GPP Packet Core Towards 5G Communication Systems
 
Beginners: 5G Terminology (Updated - Feb 2019)
Beginners: 5G Terminology (Updated - Feb 2019)Beginners: 5G Terminology (Updated - Feb 2019)
Beginners: 5G Terminology (Updated - Feb 2019)
 
Advanced: 5G NR RRC Inactive State
Advanced: 5G NR RRC Inactive StateAdvanced: 5G NR RRC Inactive State
Advanced: 5G NR RRC Inactive State
 
Next Generation Network Automation
Next Generation Network AutomationNext Generation Network Automation
Next Generation Network Automation
 
5G Core Network - ZTE 5g Cloude ServCore
5G Core Network - ZTE 5g Cloude ServCore5G Core Network - ZTE 5g Cloude ServCore
5G Core Network - ZTE 5g Cloude ServCore
 
Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel
 
Evolution of M2M Communication
Evolution of M2M CommunicationEvolution of M2M Communication
Evolution of M2M Communication
 
Introducing ultra-precise time for server-hosted applications
Introducing ultra-precise time for server-hosted applicationsIntroducing ultra-precise time for server-hosted applications
Introducing ultra-precise time for server-hosted applications
 
Advanced: Control and User Plane Separation of EPC nodes (CUPS)
Advanced: Control and User Plane Separation of EPC nodes (CUPS)Advanced: Control and User Plane Separation of EPC nodes (CUPS)
Advanced: Control and User Plane Separation of EPC nodes (CUPS)
 
5G Technology Tutorial
5G Technology Tutorial5G Technology Tutorial
5G Technology Tutorial
 
ICN Akamai's Backbone
ICN Akamai's BackboneICN Akamai's Backbone
ICN Akamai's Backbone
 
5 g core overview
5 g core overview5 g core overview
5 g core overview
 
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
Beginners: Different Types of RAN Architectures - Distributed, Centralized & ...
 
OpenStack Heat
OpenStack HeatOpenStack Heat
OpenStack Heat
 

Similar to AVX512 assembly language in FFmpeg

WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
MarketingArrowECS_CZ
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
RISC-V International
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
DefconRussia
 
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
Paris Open Source Summit
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
Aleksei Voitylov
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
Ganesan Narayanasamy
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
eurobsdcon
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
eurobsdcon
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
Lihang Li
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
dterei
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
rinnocente
 
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) HypervisorAsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
Dave Voutila
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
Chiou-Nan Chen
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
Anh Dung NGUYEN
 
Processors selection
Processors selectionProcessors selection
Processors selection
Pradeep Shankhwar
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
Shreya Baheti
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
Dwight Sabio
 
ARM Introduction.pptx
ARM Introduction.pptxARM Introduction.pptx
ARM Introduction.pptx
Pratik Gohel
 

Similar to AVX512 assembly language in FFmpeg (20)

WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) HypervisorAsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
ARM Introduction.pptx
ARM Introduction.pptxARM Introduction.pptx
ARM Introduction.pptx
 

More from Kieran Kunhya

Baby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language FunctionBaby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language Function
Kieran Kunhya
 
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Kieran Kunhya
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
Kieran Kunhya
 
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
Kieran Kunhya
 
5G for onboard racing car video
5G for onboard racing car video5G for onboard racing car video
5G for onboard racing car video
Kieran Kunhya
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Kieran Kunhya
 
How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.
Kieran Kunhya
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
Kieran Kunhya
 
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Kieran Kunhya
 
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Kieran Kunhya
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...
Kieran Kunhya
 
Don't just go IP - Go IT
Don't just go IP - Go ITDon't just go IP - Go IT
Don't just go IP - Go IT
Kieran Kunhya
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live Broadcast
Kieran Kunhya
 
Implementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsImplementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfalls
Kieran Kunhya
 
FOSS in Broadcast
FOSS in BroadcastFOSS in Broadcast
FOSS in Broadcast
Kieran Kunhya
 

More from Kieran Kunhya (15)

Baby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language FunctionBaby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language Function
 
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
 
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
 
5G for onboard racing car video
5G for onboard racing car video5G for onboard racing car video
5G for onboard racing car video
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
 
How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
 
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
 
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...
 
Don't just go IP - Go IT
Don't just go IP - Go ITDon't just go IP - Go IT
Don't just go IP - Go IT
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live Broadcast
 
Implementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsImplementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfalls
 
FOSS in Broadcast
FOSS in BroadcastFOSS in Broadcast
FOSS in Broadcast
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

AVX512 assembly language in FFmpeg

  • 1. AVX-512 assembly in FFmpeg Kieran Kunhya <kieran@kunhya.com>
  • 2. What is Assembly Language? • A low-level (close to the hardware) programming language • Intel x86 assembly language is (somewhat) backwards compatible to the 8008 CPU from 1972! • Languages like C are compiled into Assembly Language • Mnemonics are human readable versions of machine code
  • 3. What is AVX-512? • New SIMD (Single Instruction Multiple Data) Instruction Set for Intel since 2017, and more recently AMD CPUs • 512-bit register sizes (zmm) • Many new instructions • Opmasks • Comparison instructions • Many things not so interesting in multimedia (e.g cryptography, neural networks) • Lots of fancy words, but high schoolers have written assembly in FFmpeg!
  • 4. Why is this relevant now? • AVX-512 has been around since 2017. Why is it relevant now? • Old Skylake (server) systems had large performance throttling when using larger registers. AVX-512 remained unused in multimedia • Could still use new instructions with smaller registers • Can be beneficial in some cases (see later) • Ice Lake (10th and 11th Gen Intel) first to have no throttling
  • 5. How to get started? • Intel have removed AVX-512 support in consumer processors (from 12th Gen)  • Available in AMD Zen 4 • Still exists in Intel Server CPUs (Xeon). Available from cloud providers • Easiest way is to buy an Intel NUC (11th Gen)
  • 6. Existing work in multimedia using AVX-512 • dav1d project added AVX-512 support • AV1 decoding, particularly beneficial owing to large block sizes • 10-20% faster *overall* decoding • Classic FFmpeg/x264 approach to assembly • No intrinsics, nor inline assembly • Detect CPU capabilities and set function pointers to appropriate functions • Messy Venn Diagram of capabilities, but two in practice we care about
  • 7. CPU_FLAGS in FFmpeg int av_get_cpu_flags(void); #define AV_CPU_FLAG_AVX512 0x100000 ///< AVX-512 functions: requires OS support even if YMM/ZMM registers aren't used #define AV_CPU_FLAG_AVX512ICL 0x200000 ///< F/CD/BW/DQ/VL/VNNI/IFMA/VBMI/VBMI2/VPOPCNTDQ/BIT ALG/GFNI/VAES/VPCLMULQDQ
  • 8. Lanes • Older AVX ymm registers are split into lanes • Instructions (mainly) operate in lanes • Can be tricky to move data between lanes • Limitation on AVX2 code 16-byte Lane 1 16-byte Lane 0 256-bit ymm Register (old)
  • 9. K-mask registers • A new set of registers k0-k7 that allow the destination register to remain unchanged or set to zero • For example, addition, but only some values: • paddw zmm1{k1}, zmm2, zmm3 • kmovX instructions to manipulate k-mask registers
  • 10. vpermb • Byte shuffles (permute) are the one of the most important instructions in multimedia • Similar to existing pshufb instruction but cross-lane • Need to use k-masks with vpermb to zero out a byte • e.g for zigzag scan • Also, VPERMT2B, permute from two registers A B C D E F G H E A 0 C C B H D
  • 11. Variable shifts • vpsrlvw/vpsrlvd/vpsrlvq – variable right shift (logical) • vpsllvw/vpsllvd/vpsllvq – variable left shift (logical) • (letter soup, especially arithmetic shifts!) • Historically had to use multiple instructions and various trickery to achieve right shifts. Had many limitations. • Faster than multiply for left shifts
  • 12. vpternlogd • The kitchen sink of instructions! • Allows a programmable truth table to be implemented per bit input in each register • Can replace up to eight instructions! • e.g vpternlogd zmm0, zmm1, zmm2, 0xca • zmm0 = zmm0 ? zmm1 : zmm2; (for each bit) • http://0x80.pl/articles/avx512-ternary- functions.html
  • 13. Example: v210enc (1) • .loop: • movu ym1, [yq + 2*widthq] • vinserti32x4 m1, [uq + 1*widthq], 2 • vinserti32x4 m1, [vq + 1*widthq], 3 • vpermb m1, m2, m1 ; uyvx yuyx vyux yvyx • CLIPUB m1, m4, m5 • pmaddubsw m0, m1, m3 ; shift high and low samples of each dword and mask out other bits • pslld m1, 4 ; shift center sample of each dword • vpternlogd m0, m1, m6, 0xd8 ; C?B:A ; merge and mask out bad bits from B • movu [dstq], m0 • […] • jl .loop • Takes 3x 8-bit samples, extends to 10-bit and packs into 32-bits
  • 14. Example: v210enc (2) • Benchmarks (decicyles) • Skylake • v210_planar_pack_8_c: 2373.5 • […] • v210_planar_pack_8_avx2: 194.0 • v210_planar_pack_8_avx512: 174.0 • vpternlogd on a shorter ymm register • Ice Lake • v210_planar_pack_8_c: 2743.6 • v210_planar_pack_8_avx2: 246.6 • v210_planar_pack_8_avx512: 238.6 • v210_planar_pack_8_avx512icl: 122.1 • vpermb and zmm nearly twice as fast as avx512 ymm, and more than twenty times faster than C!
  • 15. What AVX-512 code next? • Anything involving line/frame based processing • e.g filters, scalers etc. • Comparisons • Vpternlogd in many places (e.g 3-way boolean) • Also change variable shifts in many places • Intel manual is very verbose (useful in some cases) • https://www.officedaytime.com/simd512e/