Talk given at internal Vimeo lunch talks with an intro to JPEG / image compression. There is a codebase that goes along with this, but it is not public yet, unfortunately.
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Let's Write a JPEG Decoder (Vimeo Lunch Talks)
1. Let’s Write a JPEG Decoder
derekb@vimeo.com
@daemon404
Derek Buitenhuis
12 December 2018
New York, USA / The Internet
2. JPEG? Who cares?
112 December 2018
• Good as a first step into codecs
• Extremely simple
• Doesn’t even have spatial prediction
• Convince people DCTs aren’t scary
• In extremely wide use and will continue to be for the foreseeable future
• Writing a JPEG encoder is a good hands on way to get into hacking on multimedia code
• Real, viewable results
Vimeo Lunch Talks
4. Step 0: RGB to Y’CbCr
312 December 2018
• Most JPEGs store image as Y’CbCr
• Some weird ones store as CMYK or XYZ
• JFIF doesn’t actually define a way to tag this info other than “number of planes”
• Most web uses are 4:2:0 subsampling
• Cb and Cr are half the resolution of Y’
• Save space for things that we notice more
• Always BT.601
Vimeo Lunch Talks
5. Step 1: Shift
412 December 2018
• Subtract 128 from all values
• DCT = Discrete Cosine Transform
• Think of Cosine’s range: [-1,1]
• Implementation note: Be careful with implicit type conversions here (uint8 / int8)
Vimeo Lunch Talks
60 → -68
6. Step 2: Apply 8x8 Forward DCT
512 December 2018
• Split planes into 8x8 blocks
• Do this:
Vimeo Lunch Talks
7. 5 Second Overview of DSP
612 December 2018
• Background:
• Convert the sample values into the frequency domain using a reversible transform
• Higher frequencies = Finer (less noticeable) details
• Lower frequencies = Less granular details (e.g. solid rectangles)
• DCT chosen over DFT because DCT happens to have a nice property where its energy is
concentrated into a smaller set of coefficients, which is better of data compression.
• Intelligently drop higher frequencies we shouldn’t notice
• Intelligently reduce precision
Vimeo Lunch Talks
9. Step 2: Apply 8x8 Forward DCT — Continued
712 December 2018
• Gu,v is the resulting DCT coefficient at point u,v (see below)
• u and v are 0 to 7 (8 spatial frequencies in each direction, since we are using 8x8 blocks)
• gx,y is the shifted sample value at point x,y in our 8x8 block
• α(u) is this function:
• If you remember your linear algebra class, this makes sure the transform’s results are orthogonal to
each other
• Useful since we want to combine basis functions, and they have to be independent!
Vimeo Lunch Talks
10. Step 2: Apply 8x8 Forward DCT — Continued
812 December 2018
• Can be sort of thought as overlaying basis functions on each other at varying intensities
• This is where coefficients come into play
Vimeo Lunch Talks
11. Step 3: Zig-zag
912 December 2018
• Notice: Low frequencies cluster near the top left and higher frequencies radiate out
• The top left (lowest frequency) value is called the DC Value
• The rest are called AC values
• These are named as such for historical reasons
• DCT was used to analyze electrical signals before this
• Re-ordering the coefficients using a zig-zag pattern yields a set ordered by frequency
• Useful for entropy coding (more on that later)
• This is where FFmpeg’s logo comes from
Vimeo Lunch Talks
12. Step 4: Quantization
1012 December 2018
• Quantization generally refers to taking a continuous (or larger set) and sampling, or mapping it to a
smaller (discrete) set.
• Aside: The universe is quantum in nature, so can we really call anything continuous?
• This is the lossy part of JPEG compression.
• We want to map our larger set of DCT coefficients (in our case, floats, but in real cases, a larger set
of integers) to a smaller set of integer we’ll actually code into the bitstream
• We do this by dividing by a 8x8 quantization matrix, and clamping to integers
• This is provided by the encoder, and coded into the bitstream
Vimeo Lunch Talks
13. Step 4: Quantization — Continued
1112 December 2018
• Example Quantization Matrix: Input:
• Output:
Vimeo Lunch Talks
14. Step 5: Run Length Encode Zeroes
1212 December 2018
• Lots of zeroes now! Let’s code them efficiently.
• Example set (in raster order): 57,45,0,0,0,0,23,0,-30,-16,0,0,1,0, …
• For sets of values like: (X,Y)
• X is the number of preceding zeroes
• Y is the next value
• Special case #1: (0,0) means fill the rest of the set with zeroes after this point
• Special case #2: (15,0) in the middle of a set means stuff 16 zeroes in
• From our example set: (0, 57); (0, 45); (4, 23); (2, -30); (0, -16); (2, 1); (0, 0)
Vimeo Lunch Talks
15. Step 6: DC Prediction
1312 December 2018
• Prediction means “predicting” a current value based off of other values
• The “other” values can be separated by space (different parts of the same time), or for video,
time (different parts of previous or future images)
• Most prediction is done before DCT, on raw sample values
• JPEG does prediction post-DCT, but only on DC values
• Someone working on JPEG noticed DC values for subsequent block were kind of similar
• So instead of coding the DC value directly, code its diff to the previous block’s (in raster order)
DC value
• First block predicts for an initial value of 0
• Next block is differed to previous block
• So if you have e.g. 3 blocks with DCs of 10, 12, 10, you end up coding 10, 2, -2
Vimeo Lunch Talks
16. Step 7: Huffman Coding
1412 December 2018
• Simple idea: Values that appear frequently in our data get assigned codes
• Codes are variable length (sometimes called VLCs, or Variable Length Codes)
• JPEG writes lengths of these codes, and these can be generated using a known algorithm once
read.
• AC and DC coefficients have separate length tables coded (remember we predicted the DC value!)
• How we assign values to codes can be optimized “cleverly” in the encoder:
• Example: mozjpeg uses something akin to Viterbi
• These lengths are written as static tables in the JPEG
• The number of Huffman codes of each length (1 to 16 bits long) along with a sorted table of the byte
values of each code.
• This will make more sense when you see the decoder code
Vimeo Lunch Talks
18. .jpeg isn’t JPEG
1612 December 2018
• What we think of as a “JPEG file” isn’t actually JPEG
• Called JFIF, and several versions exists; we’re covering 1.01
• This format is both extremely simple and way too flexible
• Allows for all sorts of crazy crap, while simultaneously being underspecified (APPN
markers)
• The decoder we’re writing today makes a lot of assumptions about files being “good”
• It’s also very slow, since we’re going more for naivety rather than optimization
Vimeo Lunch Talks
19. JFIF
1712 December 2018
• Basically a series of markers, followed by a 16-bit length
• 0xFF, 0xNN – NN is the marker
• 16-bit length
• (length - 2) worth of data
Vimeo Lunch Talks
20. 1812 December 2018 Vimeo Lunch Talks
Before anything:
You need a
bitstream reader
23. IDCT
2112 December 2018 Vimeo Lunch Talks
• Can calculate the inverse of the DCT, called theIDCT:
• No more or less scary that the forward DCT
• Our implementation will use simple matrix multiplication and floats
• Real world implementations use fast integer transforms based on butterflies (see references at
end)
24. Links & References to Read
2212 December 2018 Vimeo Lunch Talks
• Start from nothing: https://dspguide.com/pdfbook.html
• Very good intro to JFIF and JPEG: http://www.opennet.ru/docs/formats/jpeg.txt
• More advanced background (where AA&N fast DCT came from, and why, and why things are the
way there are (AC/DC)): https://www.amazon.com/JPEG-Compression-Standard-Multimedia-
Standards/dp/0442012721/
• THE intro to video codecs: https://www.amazon.com/H-264-Advanced-Video-Compression-
Standard/dp/0470516925/ (can be found digitally)