Derek Buitenhuis discusses the challenges of maintaining backwards compatibility for progressive MP4 video playback when storage has transitioned to fragmented MP4. He describes building a clever service called "Artax" to proxy fragmented MP4 files and make them appear as progressive downloads to support legacy playback methods, while avoiding expensive transcoding and storage costs. The service carefully parses MP4 metadata to understand packet interleaving and efficiently handle range requests by calculating starting positions within source files and output chunks. Maintaining this level of precision for range requests was a significant engineering challenge.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
A Progressive Approach to the Past: Ensuring Backwards Compatability Through Cleverness and Pain
1. A Progressive Approach to the Past:
Ensuring Cheap Backwards Compatibility Through Cleverness and Pain
derekb@vimeo.com / derek@videolan.org
@daemon404
Derek Buitenhuis
13 April 2021
The British Internet
2. Who’s this guy?
1
13 April 2021
• Principal Video Engineer @ Vimeo
• Open source developer (FFmpeg, FFMS2, rav1e, obuparse, etc.)
• VideoLAN non-profit board member
• Professional Twitter Sh*tposter
4. Who am I, really?
3
13 April 2021
• Currently, I’m this guy:
5. Sins of Multimedia Past Last Forever
4
13 April 2021
• It is 2021. We encode to and serve fragmented MP4 for VoD.
• Audio and video are separate files.
• Segements are just range requests.
• Easier logic, easier caching.
• Some of us encode to progressive MP4, and segment at the edge.
• Can be expensive, can require running and maintaining services.
• Some people use MPEG-TS as their mezzanine. These people are monsters.
• Problems:
• Some Very Bad Programs can only consume progressive MP4.
• Your company made a bad decision over 10 yeas ago to give direct progressive MP4 URLs
to the highest paying customers.
• 10+ years of hardcoded URLs and API use. You also support VOD downloads.
6. Support Options
5
13 April 2021
• Don’t store videos as FMP4; store as progressive.
• Almost all traffic will have to be segmented at edge. This is expensive and dumb.
• Entirely remove progressive MP4 support.
• Least engineering work, most product work.
• Anger your highest paying users. Anger product. Anger marketing. Anger viewers on terrible devices.
• Store progressive MP4s as well, or just one rendition, such as 720p.
• Not much work.
• A lot of expensive storage for a rarely used rendition of every single video.
• People will still be angry because you took away their 240p or 4K, etc.
• Write a Very Clever Service to proxy FMP4s and make them appear progressive.
• Most engineering work.
• Service will be low volume, and thus fairly cheap.
7. So You’ve Chosen Pain
6
13 April 2021
• Obviously we chose the difficult engineering one.
• Things it needed :
• Transparently expose a set of FMP4 (one video, one audio) as a progressive MP4.
• Must support exact range requests, for playback in browser and Akamai cachability.
• Every request must be performant.
• Can’t read all the source moof boxes every time (more on this later).
• There are so many MP4 muxers and demuxers, but they’re all generic and not suitable.
• Source MP4s have all the info we need, such as mdat box offsets, timestamps, and sample sizes,
so the real solution is closer to de-/re-serialization.
• All input is known good. Bad input should be hard-rejected.
• So I wrote one.
10. MP4 Anatomy (Deeper)
9
13 April 2021
[ftyp: File Type Box]
[moov: Movie Box]
[mvhd: Movie Header Box]
[trak: Track Box]
[tkhd: Track Header Box]
[edts: Edit Box]
[elst: Edit List Box]
[mdia: Media Box]
[mdhd: Media Header Box]
[hdlr: Handler Reference Box]
[minf: Media Information Box]
[vmhd: Video Media Header Box]
[dinf: Data Information Box]
[dref: Data Reference Box]
[url : Data Entry Url Box]
[stbl: Sample Table Box]
[stsd: Sample Description Box]
[avc1: Visual Description]
[avcC: AVC Configuration Box]
[colr: Colour Information Box]
[stts: Decoding Time to Sample Box]
[stsc: Sample To Chunk Box]
[stsz: Sample Size Box]
[stco: Chunk Offset Box]
[sgpd: Sample Group Description Box]
[sbgp: Sample to Group Box]
[mvex: Movie Extends Box]
[mehd: Movie Extends Header Box]
[trex: Track Extends Box]
[sidx: Segment Index Box]
[moof: Movie Fragment Box]
[mfhd: Movie Fragment Header Box]
[traf: Track Fragment Box]
[tfhd: Track Fragment Header Box]
[tfdt: Track Fragment Base Media Decode Time Box]
[trun: Track Fragment Run Box]
[sgpd: Sample Group Description Box]
[sbgp: Sample to Group Box]
[mdat: Media Data Box]
→
[ftyp: File Type Box]
[moov: Movie Box]
[mvhd: Movie Header Box]
[trak: Track Box]
[tkhd: Track Header Box]
[edts: Edit Box]
[elst: Edit List Box]
[mdia: Media Box]
[mdhd: Media Header Box]
[hdlr: Handler Reference Box]
[minf: Media Information Box]
[vmhd: Video Media Header Box]
[dinf: Data Information Box]
[dref: Data Reference Box]
[url : Data Entry Url Box]
[stbl: Sample Table Box]
[stsd: Sample Description Box]
[avc1: Visual Description]
[avcC: AVC Configuration Box]
[colr: Colour Information Box]
[stts: Decoding Time to Sample Box]
[ctts: Composition Time to Sample Box]
[stss: Sync Sample Box]
[stsc: Sample To Chunk Box]
[stsz: Sample Size Box]
[co64: Chunk Offset Box]
[sgpd: Sample Group Description Box]
[sbgp: Sample to Group Box]
[mdat: Media Data Box]
11. moov Box Strategy
10
13 April 2021
• Parse in the input moov and sidx boxes.
• Use moof offsets from the sidx boxes to use a threadpool to parse moofs in parallel.
• Construct all the non-mdat output boxes from this upfront, before reemuxing.
• This allows us to know the moov size, full file size, PTS/DTS, sync points,
and all mdat offsets upfont. This is extremely important for Content-Length and range
request support.
• Since we have all the exact parsed info from the source boxes, every size and offset
is calculable with a bit of book-keeping.
• Cache this information so that any future requests are fast.
• Now about range request support…
12. mdat Box Strategy
11
13 April 2021
• Packets sizes and positions in source files are all known.
• We need to properly interleave audio and video chunks.
• Chose 500ms interleaving.
• This interleaving is state – it must be consistent regardless of which range was requested.
• For example, you need to know, for any given range, how many packets into the chunk
you are when writing, and how they’re interleaved, 100% exactly.
• More on this in a second.
• We want to use persistent HTTP connections for reading all the mdats from source files.
• This means taking a minor hit in bandwidth by skipping over moofs, in order to keep it persistent.
• A prefetch is useful here.
13. Range Request Strategy
12
13 April 2021
• ftyp and moov boxes are calculated and cached already (byte buffer) – ranges for this are easy.
• Need to careful when handling ranges which staddle the cached moov and mdat box boundaries.
• mdat is much trickier:
• We need to calculate which source mdats (there are may per stream, remember) to start reading.
• We need to know which packets within these mdats to start outputting, and when to stop.
• We need to know how many bytes of the first and last written packets to ignore to satisfy the range.
• We need to know the exact position and state of the packet interleaving where this range starts.
• With a little pain, we can calculate this on each request, since we will know exactly what the
chunk pattern is, e.g. 12 video packets / 24 audio packets / repeat.
• If this sounds like a ton of tricky book-keeping, you are correct.