London Video Tech - Adventures in cutting every last millisecond from glass-to-glass latency
1. Adventures in cutting every last
millisecond from glass-to-glass
latency
Kieran Kunhya – kierank@obe.tv
@openbroadcastsy, @kierank_
2. Who am I, who are
we?• I work on FFmpeg, x264 and others…
• A lot related to professional video in OSS, probably has my
fingerprints on it
• At $job, Open Broadcast Systems builds software for
broadcasters mainly around video point to point
encoding/decoding for news/sport etc...
• Not to be confused with:
3. What I will talk about
• Minimising every last millisecond of latency from broadcast
production processes (before distribution)
• Encoding and Decoding often being the dominant form
of latency – will focus on this
• Doing this from a software engineering standpoint
• Not much (if any) about this at all
• Hardware-centric industry – “secret sauce” thinking
4. What I will not talk
about
• Doing live production
with high-bandwidth
(10-100GbE)
networking
• Network stack in
between (FEC vs SRT
vs RIST)
• Not the right audience
• Demuxed 2017 video
5. Live broadcast production processes (1)
• Processes in black boxes, e.g: Routing, graphics,
switching, mixing, recording, monitoring, playout,
subtitling, standards conversion etc…
• Infrastructure as complex if not more complex than
delivery
6. Live broadcast production processes (2)
• Heavily hardware (FPGA/DSP)
centric.
• Fixed function, black-box
products
• Low-latency processes in studio
• “Video Lines” of latency – order of
10-100 us.
• Uncompressed video - high data
rates, many Gbps.
• Legacy usage of satellite, fibre, SDI,
ASI
• Includes premium live web
video!
7. Video contribution
• Getting content from a remote place to one or more central
places, often studio or aggregation centre
• Minimise latency
• Often fast-paced interviews/debates
• Often uneconomical to pay for uncompressed
• Remote production, director not onsite, back at base
8. The live production
environment
• Largely SDI (coax) based
• Unidirectional, Gbps video
• Latency on order of ~video lines
(40us)
• Many I/O Boards to do this
• Abstracted away low latency into
~frames (40ms) (1000x increase!)
• SDKs hide capabilities of electronics
• Internal buffering?
• Hardware doing the data
processing (offload)
9. SDI from a software engineers point of
view
• I want the software to do as much as is
reasonably possible
• A driver, not a SDK+driver hybrid as
*all* manufacturers have
• “Offload” is irrelevant in 2019
• Start processing the data as soon as a
field arrives, not whole frame.
• Later on processing chunk by chunk
• I/O in the purest sense
• Write data and it be put to wire *now*
10. What you often get in reality
• Video and Audio on a separate File Descriptor
• Can never open them simultaneously so can never have exact
lipsync
• Long delays in and out of card (~2-3 frames)
….
• Not all audio tracks available
• Audio out of sync
• Video downconverted to 8-bit
• Not all blanking data available, less common parameters not
changeable
12. SDI from a software engineers point of
view
• Massive time and expense for the most important 4 lines of code
• DMA (direct memory access) buffers of 8192 bytes (approx. 1 HD line)
• Get an interrupt every 32 buffers
• Can capture, process chunks of video and push out in the ~100s lines!
• Tight timescales, need to be aware of thread priority, CPU powersaving
etc
13. SDI from a software engineers point of
view
• CRC not software-centric (10-bit data, 25-bit polynomial)
• We offload this otherwise big waste of CPU
• Very tedious to build frame correctly, lots of legacy
• Difficulty to verify, tools all hardware-based
• 1080p50/60 – 3G-SDI Level B, very software unfriendly
• (and lots of other implementation details)
14. Pixel formats
Only YUV 4:2:2 domain (as example)!
• Planar 10b – main working format
• Planar 8b - preview quality
• UYVY 10b (16-bit aligned) – SDI datastream
• Apple v210 – some hardware
• Contiguous 10-bit – SDI wire format
15. Pixel formats
Handwritten (no intrinsics!) SIMD for every mapping
(and others).
• 5-15x speed improvements compared to C
• Do it once, make it fast once and for all (until
new CPU…)
• Generic conversion library a difficult problem
• Intermediate pixel format(s) always a compromise
• Add special cases until you’ve done them all!
16. Basic encode / decode
pipeline
• Encoder
• Capture: 1-2 frames
• Encode (x264 lowestlatency, no audio compression): 1-frame
• Mux and other processing (~5ms)
• Decoder
• Wait for frame to arrive: 1-frame
• Decode the frame: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 1-2 frames
• Basic implementation: 7-frames, 280ms at 1080i25
17. Better encode / decode pipeline
• Encoder
• Capture: 1-frame
• Encode (x264 lowestlatency, no audio compression): 1-frame
• Mux and other processing (~5ms)
• Decoder
• Wait for frame to arrive: 1-frame
• Decode the frame: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 1 frame (10ms)
• Better implementation: 5.x-frames, 210ms at 1080i25
18. Better encode / decode pipeline
• Encoder
• Capture: 1-frame
• Encode (x264 lowestlatency, no audio compression): 1-frame
• Mux and other processing (~5ms)
• Decoder
• Wait for frame to arrive: 1-frame
• Decode the frame: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 1 frames
• Better implementation: 6-frames, 240ms at 1080i25
19. Decode frame as it arrives on the
wire
• Fix FFmpeg chunk decode
• Slices arrive at destination • Complete frame is built
20. Better encode / decode pipeline
• Encoder
• Capture: 1-frame
• Encode (x264 lowestlatency, no audio compression): 1-frame
• Mux and other processing (~5ms)
• Decoder
• Wait for frame to arrive: 1-frame
• Decode the frame as it arrives: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 10ms
• Better implementation: 4.x-frames, 170ms at 1080i25
21. Better encode / decode pipeline
• Encoder
• Capture: 1-field
• Encode (x264 lowestlatency, no audio compression): 1-field
• Mux and other processing (~5ms)
• Decoder
• Decode the frame as it arrives: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 10ms
• Better implementation: 3.x-frames, 130ms at 1080i25
22. Better encode / decode pipeline
• Encoder
• Capture: 1-field
• Encode (x264 lowestlatency, no audio compression): 1-field
• Mux and other processing (~5ms)
• Decoder
• Decode the frame as it arrives: 1-frame
• Frame synchronisation: 1-frame (drop and duplicate video,
resample audio)
• Push to wire: 10ms
• Better implementation: 3.x-frames, ~130ms at 1080i25
23. Clocks
• Drift clock to match remote clock
• Clocks do not match (temperature etc), drift can be fast
• Control the onboard oscillator on the SDI Card to match remote clock
• Saves having to drop/duplicate video and resample audio to match
• Same number of frames pushed per hour, per day etc
• At low latencies, clock drift bites you quicker
24. Better encode / decode pipeline
• Encoder
• Capture: 1-field
• Encode (x264 lowestlatency, no audio compression): 1-field
• Mux and other processing (~5ms)
• Decoder
• Decode the frame as it arrives: 1-frame
• Push to wire: 10ms
• Better implementation: 2.x-frames, 90ms at 1080i25
25. Better encode / decode pipeline
• Encoder
• Capture: 1-field
• Encode (x264 lowestlatency, no audio compression): 1-field
• Mux and other processing (~5ms)
• Decoder
• Decode the frame as it arrives: 1-frame
• Push to wire: 10ms
• Decode the frame to the wire as it arrives
• Better implementation: 1.x-frames, ~50ms at 1080i25
26. Chunk based encode and decode
• Throughout all of these improvements, bitrate roughly the same, no
loss in picture quality owing to H.264 bitexact decode.
• Diminishing returns now but some very high end applications
demand even lower latency
• Not a good idea for H.264, ratecontrol would prefer full frame
• Codecs like JPEG2000, VC-2, JPEG-XS operate on slices
• Limited use of slice based encoding in software
• Capture, Encode, Decode and Render before the frame has even
finished arrive on the wire at source (~20ms latency)
• Concert video wall, VR etc
27. Chunk based encode and decode
Destination
Source
• 10-20ms end-to-end
• Huge bitrate penalty (~100s Mbps)
• High quality network also
required
28. Thanks
• Thanks to team working on this
• James Darnley
• Rafael Carre
• Sam Willcocks