25 quantization and_compression

409 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
409
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

25 quantization and_compression

  1. 1. 4. Quantization and Data Compression ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak
  2. 2. What is data compression? •  Reducing the file size without compromising the quality of the data stored in the file too much (lossy compression) or at all (lossless compression). •  With compression, you can fit higher-quality data (e.g., higher-resolution pictures or video) into a file of the same size as required for lower-quality uncompressed data. Ilya Pollak
  3. 3. Why data compression? •  Our appetite for data (high-resolution pictures, HD video, audio, documents, etc) seems to always significantly outpace hardware capabilities for storage and transmission. Ilya Pollak
  4. 4. Data compression: Step 0 •  If the data is continuous-time (e.g., audio) or continuous-space (e.g., picture), it first needs to be discretized. Ilya Pollak
  5. 5. Data compression: Step 0 •  If the data is continuous-time (e.g., audio) or continuous-space (e.g., picture), it first needs to be discretized. •  Sampling is typically done nowadays during signal acquisition (e.g., digital camera for pictures or audio recording equipment for music and speech). Ilya Pollak
  6. 6. Data compression: Step 0 •  If the data is continuous-time (e.g., audio) or continuous-space (e.g., picture), it first needs to be discretized. •  Sampling is typically done nowadays during signal acquisition (e.g., digital camera for pictures or audio recording equipment for music and speech). •  We will not study sampling. It is studied in ECE 301, ECE 438, and ECE 440. •  We will consider compressing discrete-time or discrete-space data. Ilya Pollak
  7. 7. Example: compression of grayscale images •  An eight-bit grayscale image is a rectangular array of integers between 0 (black) and 255 (white). •  Each site in the array is called a pixel. Ilya Pollak
  8. 8. Example: compression of grayscale images •  An eight-bit grayscale image is a rectangular array of integers between 0 (black) and 255 (white). •  Each site in the array is called a pixel. •  It takes one byte (eight bits) to store one pixel value, since it can be any number between 0 and 255. Ilya Pollak
  9. 9. Example: compression of grayscale images •  An eight-bit grayscale image is a rectangular array of integers between 0 (black) and 255 (white). •  Each site in the array is called a pixel. •  It takes one byte (eight bits) to store one pixel value, since it can be any number between 0 and 255. •  It would take 25 bytes to store a 5x5 image. Ilya Pollak
  10. 10. Example: compression of grayscale images •  An eight-bit grayscale image is a rectangular array of integers between 0 (black) and 255 (white). •  Each site in the array is called a pixel. •  It takes one byte (eight bits) to store one pixel value, since it can be any number between 0 and 255. •  It would take 25 bytes to store a 5x5 image. •  Can we do better? Ilya Pollak
  11. 11. Example: compression of grayscale images 255 255 255 255 255 255 255 255 255 255 200 200 200 200 200 200 200 200 200 200 200 200 200 200 100 Can we do better than 25 bytes? Ilya Pollak
  12. 12. Two key ideas •  Idea #1: –  Transform the data to create lots of zeros. Ilya Pollak
  13. 13. Two key ideas •  Idea #1: –  Transform the data to create lots of zeros. For example, we could rasterize the image, compute the differences, and store the top left value along with the 24 differences [in reality, other transforms are used, but they work in a similar fashion] Ilya Pollak
  14. 14. Two key ideas •  Idea #1: –  Transform the data to create lots of zeros. For example, we could rasterize the image, compute the differences, and store the top left value along with the 24 differences [in reality, other transforms are used, but they work in a similar fashion]: –  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100 Ilya Pollak
  15. 15. Two key ideas •  Idea #1: –  Transform the data to create lots of zeros. For example, we could rasterize the image, compute the differences, and store the top left value along with the 24 differences [in reality, other transforms are used, but they work in a similar fashion]: –  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100 –  This seems to make things worse: now the numbers can range from −255 to 255, and therefore we need two bytes per pixel! Ilya Pollak
  16. 16. Two key ideas •  Idea #1: –  Transform the data to create lots of zeros. For example, we could rasterize the image, compute the differences, and store the top left value along with the 24 differences [in reality, other transforms are used, but they work in a similar fashion]: –  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100 –  This seems to make things worse: now the numbers can range from −255 to 255, and therefore we need two bytes per pixel! •  Idea #2: –  when encoding the data, spend fewer bits on frequently occurring numbers and more bits on rare numbers. Ilya Pollak
  17. 17. Entropy coding Suppose we are encoding realizations of a discrete random variable X such that value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 Ilya Pollak
  18. 18. Entropy coding Suppose we are encoding realizations of a discrete random variable X such that value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 Consider the following fixed-length encoder: value of X 0 255 −55 −100 codeword 00 01 10 11 Ilya Pollak
  19. 19. Entropy coding Suppose we are encoding realizations of a discrete random variable X such that value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 Consider the following fixed-length encoder: value of X 0 255 −55 −100 codeword 00 01 10 11 For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits Ilya Pollak
  20. 20. Entropy coding Suppose we are encoding realizations of a discrete random variable X such that value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 Consider the following fixed-length encoder: value of X 0 255 −55 −100 codeword 00 01 10 11 For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits Now consider the following encoder: value of X 0 255 −55 −100 codeword 1 01 000 001 Ilya Pollak
  21. 21. Entropy coding Suppose we are encoding realizations of a discrete random variable X such that value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 Consider the following fixed-length encoder: value of X 0 255 −55 −100 codeword 00 01 10 11 For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits Now consider the following encoder: value of X 0 255 −55 −100 codeword 1 01 000 001 For a file with 25 numbers, E[file size] = 25(22/25 + 2/25 + 3/25 + 3/25) = 30 bits! Ilya Pollak
  22. 22. Entropy coding •  A similar encoding scheme can be devised for a random variable of pixel differences which takes values between −255 and 255, to result in a smaller average file size than two bytes per pixel. Ilya Pollak
  23. 23. Entropy coding •  A similar encoding scheme can be devised for a random variable of pixel differences which takes values between −255 and 255, to result in a smaller average file size than two bytes per pixel. •  Another commonly used idea: run-length coding. I.e., instead of encoding each 0 individually, encode the length of each string of zeros. Ilya Pollak
  24. 24. Back to the four-symbol example value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 1 01 000 001 Can we do even better than 30 bits? Ilya Pollak
  25. 25. Back to the four-symbol example value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 1 01 000 001 Can we do even better than 30 bits? What about this alternative encoder? value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 0 01 1 10 Ilya Pollak
  26. 26. Back to the four-symbol example value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 1 01 000 001 Can we do even better than 30 bits? What about this alternative encoder? value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 0 01 1 10 E[file size] = 25(22/25 + 2/25 + 1/25+2/25) = 27 bits Ilya Pollak
  27. 27. Back to the four-symbol example value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 1 01 000 001 Can we do even better than 30 bits? What about this alternative encoder? value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 0 01 1 10 E[file size] = 25(22/25 + 2/25 + 1/25+2/25) = 27 bits Is there anything wrong with this encoder? Ilya Pollak
  28. 28. The second encoding is not uniquely decodable! value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 0 01 1 10 Encoded string ‘01’ could either be 255 or 0 followed by −55 Ilya Pollak
  29. 29. The second encoding is not uniquely decodable! value of X 0 255 −55 −100 probability 22/25 1/25 1/25 1/25 codeword 0 01 1 10 Encoded string ‘01’ could either be 255 or 0 followed by −55 Therefore, this code is unusable! It turns out that the first code is uniquely decodable. Ilya Pollak
  30. 30. What kinds of distributions are amenable to entropy coding? 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 a b c d Can do a lot better than two bits per symbol 0 a b c d Cannot do better than two bits per symbol Ilya Pollak
  31. 31. What kinds of distributions are amenable to entropy coding? 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 a b c d Can do a lot better than two bits per symbol 0 a b c d Cannot do better than two bits per symbol Conclusion: the transform procedure should be such that the numbers fed into the entropy coder have a highly concentrated histogram (a few very likely values, most values unlikely). Ilya Pollak
  32. 32. What kinds of distributions are amenable to entropy coding? 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 a b c d Can do a lot better than two bits per symbol 0 a b c d Cannot do better than two bits per symbol Conclusion: the transform procedure should be such that the numbers fed into the entropy coder have a highly concentrated histogram (a few very likely values, most values unlikely). Also, if we are encoding each number individually, they should be independent or approximately independent. Ilya Pollak
  33. 33. What if we are willing to lose some information? 253 253 255 254 255 254 254 254 255 254 252 255 255 254 252 253 253 254 254 254 252 255 253 252 253 Ilya Pollak
  34. 34. What if we are willing to lose some information? 253 253 255 254 255 253.5 253.5 253.5 253.5 253.5 254 254 254 255 254 253.5 253.5 253.5 253.5 253.5 252 255 255 254 252 253.5 253.5 253.5 253.5 253.5 253 253 254 254 254 253.5 253.5 253.5 253.5 253.5 252 255 253 252 253 253.5 253.5 253.5 253.5 253.5 Quantization Ilya Pollak
  35. 35. Some eight-bit images The five stripes contain random values from (left to right): {252,253,254,255}, {188,189,190,191}, {125,126,127,128}, {61,62,63,64}, {0,1,2,3}. The five stripes contain random integers from (left to right): {240,…,255}, {176,…,191}, {113,…,128}, {49,…,64 }, {0,…,15}. Ilya Pollak
  36. 36. Converting continuous-valued to discrete-valued signals •  Many real-world signals are continuous-valued. –  audio signal a(t): both the time argument t and the intensity value a(t) are continuous; –  image u(x,y): both the spatial location (x,y) and the image intensity value u(x,y) are continuous; –  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous. Ilya Pollak
  37. 37. Converting continuous-valued to discrete-valued signals •  Many real-world signals are continuous-valued. –  audio signal a(t): both the time argument t and the intensity value a(t) are continuous; –  image u(x,y): both the spatial location (x,y) and the image intensity value u(x,y) are continuous; –  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous. •  Discretizing the argument values t, x, and y (or sampling), is studied in ECE 301, 438, and 440. Ilya Pollak
  38. 38. Converting continuous-valued to discrete-valued signals •  Many real-world signals are continuous-valued. –  audio signal a(t): both the time argument t and the intensity value a(t) are continuous; –  image u(x,y): both the spatial location (x,y) and the image intensity value u(x,y) are continuous; –  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous. •  Discretizing the argument values t, x, and y (or sampling), is studied in ECE 301, 438, and 440. •  However, in addition to descretizing the argument values, the signal values must be discretized as well in order to be digitally stored. Ilya Pollak
  39. 39. Quantization •  Digitizing a continuous-valued signal into a discrete and finite set of values. •  Converting a discrete-valued signal into another discrete -valued signal, with fewer possible discrete values. Ilya Pollak
  40. 40. How to compare two quantizers? •  Suppose data X(1),…,X(N) is quantized using two quantizers, to result in Y1(1),…,Y1(N) and Y2(1),…,Y2(N). •  Suppose both Y1(1),…,Y1(N) and Y2(1),…,Y2(N) can be encoded with the same number of bits. •  Which quantization is better? •  The one that results in less distortion. But how to measure distortion? –  In general, measuring and modeling perceptual image similarity and similarity of audio are open research problems. –  Some useful things are known about human audio and visual systems that inform the design of quantizers. Ilya Pollak
  41. 41. Sensitivity of the Human Visual System to Contrast Changes, as a Function of Frequency Ilya Pollak
  42. 42. Sensitivity of the Human Visual System to Contrast Changes, as a Function of Frequency [From Mannos-Sakrison IEEE-IT 1974] Ilya Pollak
  43. 43. Sensitivity of the Human Visual System to Contrast Changes, as a Function of Frequency [From Mannos-Sakrison IEEE-IT 1974] High and low frequencies may be quantized more coarsely Ilya Pollak
  44. 44. But there are many other intricacies in the way human visual system computes similarity… Ilya Pollak
  45. 45. Are these two images similar? Ilya Pollak
  46. 46. What about these two? Ilya Pollak
  47. 47. What about these two? •  Performance assessment of compression algorithms and quantizers is complicated, because measuring image fidelity is complicated. •  Often, very simple distortion measures are used such as mean-square error. Ilya Pollak
  48. 48. Scalar vs Vector Quantization s s 255 255 127 95 r •  quantize each value separately •  simple thresholding 0 127 255 0 95 255 r •  quantize several values jointly •  more complex Ilya Pollak
  49. 49. What kinds of joint distributions are amenable to scalar quantization? s 255 127 r If (r,s) are jointly uniform over green square (or, more generally, independent), knowing r does not tell us anything about s. Best thing to do: make quantization decisions independently. 0 127 255 Ilya Pollak
  50. 50. What kinds of joint distributions are amenable to scalar quantization? s s 255 255 127 95 r If (r,s) are jointly uniform over green square (or, more generally, independent), knowing r does not tell us anything about s. Best thing to do: make quantization decisions independently. 0 127 255 r If (r,s) are jointly uniform over yellow region, knowing r tells us a lot about s. 0 95 255 Best thing to do: make quantization decisions jointly. Ilya Pollak
  51. 51. What kinds of joint distributions are amenable to scalar quantization? s s 255 255 127 95 r If (r,s) are jointly uniform over green square (or, more generally, independent), knowing r does not tell us anything about s. Best thing to do: make quantization decisions independently. 0 127 255 r If (r,s) are jointly uniform over yellow region, knowing r tells us a lot about s. 0 95 255 Best thing to do: make quantization decisions jointly. Conclusion: if the data is transformed before quantization, the transform procedure should be such that the coefficients fed into the quantizer are independent (or at least uncorrelated, or almost uncorrelated), in order to enable the simpler scalar quantization. Ilya Pollak
  52. 52. More on Scalar Quantization •  Does it make sense to do scalar quantization with different quantization bins for different variables? s 255 127 0 127 255 r Ilya Pollak
  53. 53. More on Scalar Quantization •  Does it make sense to do scalar quantization with different quantization bins for different variables? –  No reason to do this if we are quantizing grayscale pixel values. s 255 127 0 127 255 r Ilya Pollak
  54. 54. More on Scalar Quantization •  Does it make sense to do scalar quantization with different quantization bins for different variables? –  No reason to do this if we are quantizing grayscale pixel values. –  However, if we can decompose the image into components that are less perceptually important and more perceptually important, we should use larger quantization bins for the less important components. s 255 127 0 127 255 r Ilya Pollak
  55. 55. Structure of a Typical Lossy Compression Algorithm for Audio, Images, or Video data transform quantization entropy coding compressed bitstream Ilya Pollak
  56. 56. Structure of a Typical Lossy Compression Algorithm for Audio, Images, or Video data transform quantization entropy coding compressed bitstream Let’s more closely consider quantization and entropy coding. (Various transforms are considered in ECE 301 and ECE 438.) Ilya Pollak
  57. 57. Quantization: problem statement Sequence of discrete or continuous random variables X(1),…,X(N) (e.g., transformed image pixel values). Source (e.g., image, video, speech signal) Ilya Pollak
  58. 58. Quantization: problem statement Sequence of discrete or continuous random variables X(1),…,X(N) (e.g., transformed image pixel values). Source (e.g., image, video, speech signal) Sequence of discrete random variables Y(1),…,Y(N), each distributed over a finite set of values (quantization levels) Quantizer Ilya Pollak
  59. 59. Quantization: problem statement Sequence of discrete or continuous random variables X(1),…,X(N) (e.g., transformed image pixel values). Source (e.g., image, video, speech signal) Sequence of discrete random variables Y(1),…,Y(N), each distributed over a finite set of values (quantization levels) Quantizer Errors: D(1),…,D(N) where D(n) = X(n) − Y(n) Ilya Pollak
  60. 60. MSE is a widely used measure of distortion of quantizers •  Suppose data X(1),…,X(N) are quantized, to result in Y(1),…,Y(N). ⎡N ⎡N 2⎤ 2⎤ E ⎢ ∑ ( X(n) − Y (n)) ⎥ = E ⎢ ∑ ( D(n)) ⎥ ⎣ n =1 ⎦ ⎣ n =1 ⎦ 2 If D(1),..., D(N ) are identically distributed, this is the same as NE ⎡( D(n)) ⎤ , for any n. ⎣ ⎦ Ilya Pollak
  61. 61. Scalar uniform quantization •  Use quantization intervals (bins) of equal size [x1,x2), [x2,x3),…[xL,xL+1]. •  Quantization levels q1, q2,…, qL. •  Each quantization level is in the middle of the corresponding quantization bin: qk=(xk+xk+1)/2. Ilya Pollak
  62. 62. Scalar uniform quantization •  Use quantization intervals (bins) of equal size [x1,x2), [x2,x3),…[xL,xL+1]. •  Quantization levels q1, q2,…, qL. •  Each quantization level is in the middle of the corresponding quantization bin: qk=(xk+xk+1)/2. •  If quantizer input X is in [xk,xk+1), the corresponding quantized value is Y = qk. Ilya Pollak
  63. 63. Uniform vs non-uniform quantization •  Uniform quantization is not a good strategy for distributions which significantly differ from uniform. Ilya Pollak
  64. 64. Uniform vs non-uniform quantization •  Uniform quantization is not a good strategy for distributions which significantly differ from uniform. •  If the distribution is non-uniform, it is better to spend more quantization levels on more probable parts of the distribution and fewer quantization levels on less probable parts. Ilya Pollak
  65. 65. Scalar Lloyd-Max quantizer •  X = source random variable with a known distribution. We assume it to be a continuous r.v. with PDF fX(x)>0. Ilya Pollak
  66. 66. Scalar Lloyd-Max quantizer •  X = source random variable with a known distribution. We assume it to be a continuous r.v. with PDF fX(x)>0. –  The results can be extended to discrete or mixed random variables, and to continuous random variables whose density can be zero for some x. Ilya Pollak
  67. 67. Scalar Lloyd-Max quantizer •  X = source random variable with a known distribution. We assume it to be a continuous r.v. with PDF fX(x)>0. –  The results can be extended to discrete or mixed random variables, and to continuous random variables whose density can be zero for some x. •  Quantization intervals (x1,x2), [x2,x3),…[xL,xL+1) and levels q1, …, qL such that –  x1 = −∞ –  xL+1 = ∞ –  −∞ < q1 < x2 ≤ q2 < x3 ≤ q3 < … ≤ qL −1 < x L ≤ qL < +∞ I.e., qk ∈k-th quantization interval Ilya Pollak
  68. 68. Scalar Lloyd-Max quantizer •  X = source random variable with a known distribution. We assume it to be a continuous r.v. with PDF fX(x)>0. –  The results can be extended to discrete or mixed random variables, and to continuous random variables whose density can be zero for some x. •  Quantization intervals (x1,x2), [x2,x3),…[xL,xL+1) and levels q1, …, qL such that –  x1 = −∞ –  xL+1 = ∞ –  −∞ < q1 < x2 ≤ q2 < x3 ≤ q3 < … ≤ qL −1 < x L ≤ qL < +∞ I.e., qk ∈k-th quantization interval •  Y = the result of quantizing X, a discrete random variable with L possible outcomes, q1, q2,…, qL, defined by ⎧ ⎪ ⎪ ⎪ Y = Y (X) = ⎨ ⎪ ⎪ ⎪ ⎩ q1 if X < x2 q2 if x 2 ≤ X < x3   qL −1 if x L −1 ≤ X < x L qL X ≥ xL Ilya Pollak
  69. 69. Scalar Lloyd-Max quantizer: goal •  Given the pdf fX(x) of the source r.v. X and the desired number L of quantization levels, find the quantization interval endpoints x2,…,xL and quantization levels q1,…, qL to minimize the mean-square error, E[(Y−X)2]. Ilya Pollak
  70. 70. Scalar Lloyd-Max quantizer: goal •  Given the pdf fX(x) of the source r.v. X and the desired number L of quantization levels, find the quantization interval endpoints x2,…,xL and quantization levels q1,…, qL to minimize the mean-square error, E[(Y−X)2]. •  To do this, express the mean-square error in terms of the quantization interval endpoints and quantization levels, and find the minimum (or minima) through differentiation. Ilya Pollak
  71. 71. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ( y(x) − x )2 f X (x)dx ∫ −∞ Ilya Pollak
  72. 72. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ( y(x) − x )2 f X (x)dx ∫ k =1 xk Ilya Pollak
  73. 73. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Ilya Pollak
  74. 74. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∂ 2 Minimize w.r.t. qk : E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ ∫ ( y(x) − x ) k =1 xk xk+1 ∫ 2 (q k 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk − x ) f X (x)dx = 0 xk Ilya Pollak
  75. 75. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 f X (x)dx = ∑ ∂ 2 Minimize w.r.t. qk : E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ xk+1 ∫q xk L xk+1 ∫ ( y(x) − x ) k =1 xk xk+1 ∫ 2 (q k 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk − x ) f X (x)dx = 0 xk xk+1 f (x)dx = k X ∫ xf X (x)dx xk Ilya Pollak
  76. 76. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∂ 2 Minimize w.r.t. qk : E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ ∫ ( y(x) − x ) k =1 xk xk+1 ∫ 2 (q k 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk − x ) f X (x)dx = 0 xk xk+1 xk+1 ∫ xk xk+1 qk f X (x)dx = ∫ xk xf X (x)dx, therefore qk = ∫ xf X (x)dx ∫ f X (x)dx xk xk+1 xk Ilya Pollak
  77. 77. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∂ 2 Minimize w.r.t. qk : E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ ∫ ( y(x) − x ) k =1 xk xk+1 ∫ 2 (q k 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk − x ) f X (x)dx = 0 xk xk+1 xk+1 ∫ xk xk+1 qk f X (x)dx = ∫ xk xf X (x)dx, therefore qk = ∫ xf X (x)dx ∫ f X (x)dx xk xk+1 = E [ X | X ∈k-th quantization interval] xk Ilya Pollak
  78. 78. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∂ 2 Minimize w.r.t. qk : E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ ∫ ( y(x) − x ) k =1 xk xk+1 ∫ 2 (q k 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk − x ) f X (x)dx = 0 xk xk+1 xk+1 ∫ xk xk+1 qk f X (x)dx = ∫ xk xf X (x)dx, therefore qk = ∫ xf X (x)dx ∫ f X (x)dx xk xk+1 = E [ X | X ∈k-th quantization interval] xk ∂2 2 This is a minimum, since 2 E ⎡(Y − X ) ⎤ = ⎦ ∂qk ⎣ xk+1 ∫ 2f X (x)dx > 0. xk Ilya Pollak
  79. 79. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L Ilya Pollak
  80. 80. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) −∞ 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L: x xk+1 ⎫ ∂ ∂ ⎧ k 2 2 ⎪ ⎪ 2 ⎡(Y − X ) ⎤ = E⎣ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬ ⎦ ∂x ∂xk k ⎪ xk−1 ⎪ xk ⎩ ⎭ Ilya Pollak
  81. 81. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk −∞ 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L: x xk+1 ⎫ ∂ ∂ ⎧ k 2 2 ⎪ ⎪ 2 ⎡(Y − X ) ⎤ = E⎣ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬ ⎦ ∂x ∂xk k ⎪ xk−1 ⎪ xk ⎩ ⎭ = ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) 2 2 Ilya Pollak
  82. 82. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk −∞ 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L: x xk+1 ⎫ ∂ ∂ ⎧ k 2 2 ⎪ ⎪ 2 ⎡(Y − X ) ⎤ = E⎣ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬ ⎦ ∂x ∂xk k ⎪ xk−1 ⎪ xk ⎩ ⎭ = ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0. 2 2 By assumption, f X (x) ≠ 0 and qk −1 ≠ qk . Ilya Pollak
  83. 83. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk −∞ 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L: x xk+1 ⎫ ∂ ∂ ⎧ k 2 2 ⎪ ⎪ 2 ⎡(Y − X ) ⎤ = E⎣ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬ ⎦ ∂x ∂xk k ⎪ xk−1 ⎪ xk ⎩ ⎭ = ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0. 2 2 By assumption, f X (x) ≠ 0 and qk −1 ≠ qk . Therefore, q + qk xk = k −1 , for k = 2,…, L. 2 Ilya Pollak
  84. 84. Scalar Lloyd-Max quantizer: derivation E ⎡(Y − X ) ⎤ = ⎣ ⎦ 2 ∞ ∫ ( y(x) − x ) 2 L xk+1 f X (x)dx = ∑ ∫ ( y(x) − x ) k =1 xk −∞ 2 L xk+1 f X (x)dx = ∑ ( qk − x )2 f X (x)dx ∫ k =1 xk Minimize w.r.t. xk , for k = 2,…, L: x xk+1 ⎫ ∂ ∂ ⎧ k 2 2 ⎪ ⎪ 2 ⎡(Y − X ) ⎤ = E⎣ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬ ⎦ ∂x ∂xk k ⎪ xk−1 ⎪ xk ⎩ ⎭ = ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0. 2 2 By assumption, f X (x) ≠ 0 and qk −1 ≠ qk . Therefore, q + qk xk = k −1 , for k = 2,…, L. 2 ∂2 2 This is a minimum, since 2 E ⎡(Y − X ) ⎤ = 2 ( qk − qk −1 ) f X (xk ) > 0. ⎦ ∂xk ⎣ Ilya Pollak
  85. 85. Nonlinear system to be solved xk+1 ⎧ ⎪ ∫ xfX (x)dx xk ⎪ qk = xk+1 = E [ X | X ∈k-th quantization interval], for k = 1,…, L ⎪ ⎪ ⎨ ∫ fX (x)dx ⎪ xk ⎪ ⎪ xk = qk −1 + qk , for k = 2,…, L ⎪ 2 ⎩ Ilya Pollak
  86. 86. Nonlinear system to be solved xk+1 ⎧ ⎪ ∫ xfX (x)dx xk ⎪ qk = xk+1 = E [ X | X ∈k-th quantization interval], for k = 1,…, L ⎪ ⎪ ⎨ ∫ fX (x)dx ⎪ xk ⎪ ⎪ xk = qk −1 + qk , for k = 2,…, L ⎪ 2 ⎩ •  Closed-form solution can be found only for very simple PDFs. –  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer. Ilya Pollak
  87. 87. Nonlinear system to be solved xk+1 ⎧ ⎪ ∫ xfX (x)dx xk ⎪ qk = xk+1 = E [ X | X ∈k-th quantization interval], for k = 1,…, L ⎪ ⎪ ⎨ ∫ fX (x)dx ⎪ xk ⎪ ⎪ xk = qk −1 + qk , for k = 2,…, L ⎪ 2 ⎩ •  Closed-form solution can be found only for very simple PDFs. –  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer. •  In general, an approximate solution can be found numerically, via an iterative algorithm (e.g., lloyds command in Matlab). Ilya Pollak
  88. 88. Nonlinear system to be solved xk+1 ⎧ ⎪ ∫ xfX (x)dx xk ⎪ qk = xk+1 = E [ X | X ∈k-th quantization interval], for k = 1,…, L ⎪ ⎪ ⎨ ∫ fX (x)dx ⎪ xk ⎪ ⎪ xk = qk −1 + qk , for k = 2,…, L ⎪ 2 ⎩ •  Closed-form solution can be found only for very simple PDFs. –  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer. •  In general, an approximate solution can be found numerically, via an iterative algorithm (e.g., lloyds command in Matlab). •  For real data, typically the PDF is not given and therefore needs to be estimated using, for example, histograms constructed from the observed data. Ilya Pollak
  89. 89. Vector Lloyd-Max quantizer? X = ( X(1),…, X(N )) = source random vector with a given joint distribution. L = a desired number of quantization points. Ilya Pollak
  90. 90. Vector Lloyd-Max quantizer? X = ( X(1),…, X(N )) = source random vector with a given joint distribution. L = a desired number of quantization points. We would like to find: (1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and (2) L quantization points q1 ∈A1 ,…, q L ∈AL Ilya Pollak
  91. 91. Vector Lloyd-Max quantizer? X = ( X(1),…, X(N )) = source random vector with a given joint distribution. L = a desired number of quantization points. We would like to find: (1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and (2) L quantization points q1 ∈A1 ,…, q L ∈AL , such that the quantized random vector, defined by Y = q k if X ∈Ak , for k = 1,…, L, minimizes the mean-square error, ⎡N 2⎤ E ⎡ Y − X ⎤ = E ⎢ ∑ (Y (n) − X(n)) ⎥ ⎣ ⎦ ⎣ n =1 ⎦ 2 Ilya Pollak
  92. 92. Vector Lloyd-Max quantizer? X = ( X(1),…, X(N )) = source random vector with a given joint distribution. L = a desired number of quantization points. We would like to find: (1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and (2) L quantization points q1 ∈A1 ,…, q L ∈AL , such that the quantized random vector, defined by Y = q k if X ∈Ak , for k = 1,…, L, minimizes the mean-square error, ⎡N 2⎤ E ⎡ Y − X ⎤ = E ⎢ ∑ (Y (n) − X(n)) ⎥ ⎣ ⎦ ⎣ n =1 ⎦ 2 Difficulty: cannot differentiate with respect to a set Ak , and so unless the set of all allowed partitions is somehow restricted, this cannot be solved. Ilya Pollak
  93. 93. Hopefully, prior discussion gives you some idea about various issues involved in quantization. And now, on to entropy coding… data transform quantization entropy coding compressed bitstream Ilya Pollak
  94. 94. Problem statement Source (e.g., image, video, speech signal, or quantizer output) Sequence of discrete random variables X(1),…,X(N) (e.g., transformed image pixel values), assumed to be independent and identically distributed over a finite alphabet {a1,…,aM}. Ilya Pollak
  95. 95. Problem statement Source (e.g., image, video, speech signal, or quantizer output) Sequence of discrete random variables X(1),…,X(N) (e.g., transformed image pixel values), assumed to be independent and identically distributed over a finite Encoder: mapping alphabet {a1,…,aM}. between source symbols and binary strings (codewords) Binary string Requirements: •  minimize the expected length of the binary string; •  the binary string needs to be uniquely decodable, i.e., we need to be able to infer X(1),…,X(N) from it! Ilya Pollak
  96. 96. Problem statement Source (e.g., image, video, speech signal, or quantizer output) Sequence of discrete random variables X(1),…,X(N) (e.g., transformed image pixel values), assumed to be independent and identically distributed over a finite Encoder: mapping alphabet {a1,…,aM}. between source symbols and binary strings (codewords) Binary string •  Since X(1),…,X(N) are assumed independent in this model, we will encode each of them separately. •  Each can assume any value among {a1,…,aM}. •  Therefore, our code will consist of M codewords, one for each symbol a1,…,aM. symbol codeword a1 w1 … … aM wM Ilya Pollak
  97. 97. Unique Decodability symbol codeword a 0 b 1 c 00 d 01 •  How to decode the following string: 0001? •  It could be aaab or aad or acb or cab or cd. •  Not uniquely decodable! Ilya Pollak
  98. 98. A condition that ensures unique decodability •  Prefix condition: no codeword in the code is a prefix for any other codeword. Ilya Pollak
  99. 99. A condition that ensures unique decodability •  Prefix condition: no codeword in the code is a prefix for any other codeword. •  If the prefix condition is satisfied, then the code is uniquely decodable. –  Proof. Take a bit string W that corresponds to two different strings of symbols, A and B. If the first symbols in A and B are the same, discard them and the corresponding portion of W. Repeat until either there are no bits left in W (in this case A=B) or the first symbols in A and B are different. Then one of the codewords corresponding to these two symbols is a prefix for the other. Ilya Pollak
  100. 100. A condition that ensures unique decodability •  Prefix condition: no codeword in the code is a prefix for any other codeword. •  Visualizing binary strings. Form a binary tree where each branch is labeled 0 or 1. Each codeword w can be associated with the unique node of the tree such that string of 0’s and 1’s on the path from the root to the node forms w. Ilya Pollak
  101. 101. A condition that ensures unique decodability •  Prefix condition: no codeword in the code is a prefix for any other codeword. •  Visualizing binary strings. Form a binary tree where each branch is labeled 0 or 1. Each codeword w can be associated with the unique node of the tree such that string of 0’s and 1’s on the path from the root to the node forms w. •  Prefix condition holds if an only if all the codewords are leaves of the binary tree. Ilya Pollak
  102. 102. A condition that ensures unique decodability •  Prefix condition: no codeword in the code is a prefix for any other codeword. •  Visualizing binary strings. Form a binary tree where each branch is labeled 0 or 1. Each codeword w can be associated with the unique node of the tree such that string of 0’s and 1’s on the path from the root to the node forms w. •  Prefix condition holds if an only if all the codewords are leaves of the binary tree---i.e., if no codeword is a descendant of another codeword. Ilya Pollak
  103. 103. Example: no prefix condition, no unique decodability, one word is not a leaf symbol codeword a 0 b 1 c 00 d 01 •  Codeword 0 is a prefix for both codeword 00 and codeword 01 Ilya Pollak
  104. 104. Example: no prefix condition, no unique decodability, one word is not a leaf symbol codeword a 0 b 1 c d •  Codeword 0 is a prefix for both codeword 00 and codeword 01 wa=0 0 1 wb=1 Ilya Pollak
  105. 105. Example: no prefix condition, no unique decodability, one word is not a leaf symbol codeword a 0 b 1 c 00 d •  Codeword 0 is a prefix for both codeword 00 and codeword 01 wc=00 0 wa=0 0 1 wb=1 Ilya Pollak
  106. 106. Example: no prefix condition, no unique decodability, one word is not a leaf symbol codeword a 0 b 1 c 00 d 01 •  Codeword 0 is a prefix for both codeword 00 and codeword 01 wc=00 0 wa=0 0 1 wd=01 1 wb=1 Ilya Pollak
  107. 107. Example: prefix condition, all words are leaves symbol codeword a 1 b c d 0 1 wa=1 Ilya Pollak
  108. 108. Example: prefix condition, all words are leaves symbol codeword a 1 b 01 c d 0 0 1 wb=01 1 wa=1 Ilya Pollak
  109. 109. Example: prefix condition, all words are leaves symbol a 1 b 01 c 000 d wc=000 codeword 001 0 0 1 wd=001 0 1 wb=01 1 wa=1 Ilya Pollak
  110. 110. Example: prefix condition, all words are leaves symbol a 1 b 01 c 000 d wc=000 codeword 001 0 0 1 wd=001 0 •  No path from the root to a codeword contains another codeword. This is equivalent to saying that the prefix condition holds. 1 wb=01 1 wa=1 Ilya Pollak
  111. 111. Example: prefix condition, all words are leaves => unique decodability symbol a 1 wd=001 000 d 0 01 c 0 1 b wc=000 codeword 001 Decoding: traverse the string left to right, tracing the corresponding path from the root of the binary tree. Each time a leaf is reached, output the codeword and go back to the root. 0 1 wb=01 1 wa=1 Ilya Pollak
  112. 112. Example: prefix condition, all words are leaves => unique decodability How to decode the following string? wc=000 000001101 0 0 1 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  113. 113. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  114. 114. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  115. 115. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  116. 116. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: c 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  117. 117. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: c 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  118. 118. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: c 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  119. 119. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: c 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  120. 120. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: cd 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  121. 121. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 output: cd 0 1 wd=001 wb=01 1 wa=1 Ilya Pollak
  122. 122. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 wd=001 output: cda 0 1 wb=01 1 wa=1 Ilya Pollak
  123. 123. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 wd=001 output: cda 0 1 wb=01 1 wa=1 Ilya Pollak
  124. 124. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 wd=001 output: cda 0 1 wb=01 1 wa=1 Ilya Pollak
  125. 125. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 wd=001 output: cdab 0 1 wb=01 1 wa=1 Ilya Pollak
  126. 126. Example: prefix condition, all words are leaves => unique decodability wc=000 000001101 0 0 1 wd=001 0 1 wb=01 final output: cdab 1 wa=1 Ilya Pollak
  127. 127. Prefix condition and unique decodability •  There are uniquely decodable codes which do not satisfy the prefix condition (e.g., {0, 01}). Ilya Pollak
  128. 128. Prefix condition and unique decodability •  There are uniquely decodable codes which do not satisfy the prefix condition (e.g., {0, 01}). For any such code, a prefix condition code can be constructed with an identical set of codeword lengths. (E.g., {0, 10} for {0, 01}.) Ilya Pollak
  129. 129. Prefix condition and unique decodability •  There are uniquely decodable codes which do not satisfy the prefix condition (e.g., {0, 01}). For any such code, a prefix condition code can be constructed with an identical set of codeword lengths. (E.g., {0, 10} for {0, 01}.) •  For this reason, we can consider just prefix condition codes. Ilya Pollak
  130. 130. Entropy coding •  Given a discrete random variable X with M possible outcomes (“symbols” or “letters”) a1,…,aM and with PMF pX, what is the lowest achievable expected codeword length among all the uniquely decodable codes? –  Answer depends on pX; Shannon’s source coding theorem provides bounds. •  How to construct a prefix condition code which achieves this expected codeword length? –  Answer: Huffman code. Ilya Pollak
  131. 131. Huffman code •  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied, reorder the outcomes so that it is satisfied.) Ilya Pollak
  132. 132. Huffman code •  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied, reorder the outcomes so that it is satisfied.) •  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that ⎧ a12 ⎪ X' = ⎨ ⎪ X ⎩ if X = a1 or X = a2 otherwise Ilya Pollak
  133. 133. Huffman code •  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied, reorder the outcomes so that it is satisfied.) •  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that ⎧ a12 ⎪ X' = ⎨ ⎪ X ⎩ if X = a1 or X = a2 otherwise ⎧ p ( a ) + p ( a ) if a = a ⎪ X 1 X 2 12 pX ' ( a ) = ⎨ if a = a3 ,…, aM ⎪ pX ( a ) ⎩ Ilya Pollak
  134. 134. Huffman code •  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied, reorder the outcomes so that it is satisfied.) •  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that ⎧ a12 ⎪ X' = ⎨ ⎪ X ⎩ if X = a1 or X = a2 otherwise ⎧ p ( a ) + p ( a ) if a = a ⎪ X 1 X 2 12 pX ' ( a ) = ⎨ if a = a3 ,…, aM ⎪ pX ( a ) ⎩ •  Suppose we have a tree, T’, for an optimal prefix condition code for X’. A tree T for an optimal prefix condition code for X can be obtained from T’ by splitting the leaf a12 into two leaves corresponding to a1 and a2. Ilya Pollak
  135. 135. Huffman code •  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied, reorder the outcomes so that it is satisfied.) •  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that ⎧ a12 ⎪ X' = ⎨ ⎪ X ⎩ if X = a1 or X = a2 otherwise ⎧ p ( a ) + p ( a ) if a = a ⎪ X 1 X 2 12 pX ' ( a ) = ⎨ if a = a3 ,…, aM ⎪ pX ( a ) ⎩ •  Suppose we have a tree, T’, for an optimal prefix condition code for X’. A tree T for an optimal prefix condition code for X can be obtained from T’ by splitting the leaf a12 into two leaves corresponding to a1 and a2. •  We won’t prove this. Ilya Pollak
  136. 136. letter pX(letter) a1 0.10 a2 0.10 a3 0.25 a4 0.25 a5 Example 0.30 Ilya Pollak
  137. 137. letter pX(letter) a1 0.10 a2 0.10 a3 0.25 a4 0.25 a5 Example 0.30 Step 1: combine the two least likely letters. letter pX’(letter) a12 0.20 a3 0.25 a4 0.25 a5 0.30 Ilya Pollak
  138. 138. letter pX(letter) a1 0.10 a2 0.10 a3 0.25 a4 0.25 a5 0.30 Example a2 1 pX’(letter) a12 0.20 a3 0.25 0.25 a5 a1 letter a4 Step 1: combine the two least likely letters. 0.30 a12 0 Ilya Pollak
  139. 139. letter pX(letter) a1 0.10 a2 0.10 a3 0.25 a4 0.25 a5 0.30 Example 1 Tree for X: a2 a12 pX’(letter) a12 0.20 a3 0.25 0.25 a5 a1 letter a4 Step 1: combine the two least likely letters. 0.30 Tree for X’ (still to be constructed) 0 Ilya Pollak
  140. 140. Example letter pX’(letter) a12 0.20 a3 0.25 a4 0.25 a5 Step 2: combine the two least likely letters from the new alphabet. 0.30 letter pX’’(letter) a123 0.45 a4 0.25 a5 0.30 Ilya Pollak
  141. 141. Example letter pX’(letter) a12 0.20 a3 0.25 a4 0.25 a5 Step 2: combine the two least likely letters from the new alphabet. 0.30 1 0 a123 0.45 a4 0.25 0.30 a12 1 a2 pX’’(letter) a5 a1 letter a3 a123 0 Ilya Pollak
  142. 142. Example letter pX’(letter) a12 0.20 a3 0.25 a4 0.25 a5 Step 2: combine the two least likely letters from the new alphabet. 0.30 1 Tree for X: 0 a123 0.45 a4 0.25 0.30 a12 1 a2 pX’’(letter) a5 a1 letter a3 0 a123 Tree for X’’ Ilya Pollak
  143. 143. Example letter pX’(letter) a12 0.20 a3 0.25 a4 0.25 a5 Step 2: combine the two least likely letters from the new alphabet. 0.30 1 Tree for X: 0 a12 a3 a123 0.45 a4 0.25 0.30 Tree for X’ 1 a2 pX’’(letter) a5 a1 letter 0 a123 Tree for X’’ Ilya Pollak
  144. 144. Example letter pX’’(letter) a123 0.45 a4 0.25 a5 0.30 Step 3: again combine the two least likely letters a1 1 0 pX’’’(letter) a123 0.45 a45 0.55 a12 1 a2 letter a3 a4 a5 a123 0 1 a45 0 Ilya Pollak
  145. 145. Example letter pX’’(letter) a123 0.45 a4 0.25 a5 0.30 Step 3: again combine the two least likely letters a1 1 Tree for X: 0 pX’’’(letter) a123 0.45 a45 0.55 a12 1 a2 letter a3 a123 Tree for X’’’ 0 a4 a5 1 a45 0 Ilya Pollak
  146. 146. Example letter pX’’(letter) a123 0.45 a4 0.25 a5 0.30 Step 3: again combine the two least likely letters a1 1 Tree for X: a12 0 a3 a5 a123 0.45 a45 0.55 Tree for X’’’ 0 a4 pX’’’(letter) Tree for X’’ a123 1 a2 letter 1 a45 0 Ilya Pollak
  147. 147. Example letter pX’’(letter) a123 0.45 a4 0.25 a5 0.30 Step 3: again combine the two least likely letters a1 1 Tree for X: 0 a12 a3 a5 a123 0.45 a45 0.55 Tree for X’’ a123 Tree for X’’’ 0 a4 pX’’’(letter) Tree for X’ 1 a2 letter 1 a45 0 Ilya Pollak
  148. 148. Example letter pX’’’(letter) a123 0.45 a45 Step 4: combine the last two remaining letters 0.55 Done! a1 1 Tree for X: a12 1 a2 0 a3 a4 a5 a123 1 0 1a 45 a12345 0 0 Ilya Pollak
  149. 149. Example letter pX’’’(letter) a123 0.45 a45 Step 4: combine the last Done! The codeword two remaining letters for each leaf is the sequence 0.55 of 0’1 and 1’s along the path from the root to that leaf. a1 1 Tree for X: 1 a2 0 a3 a4 a5 1 0 1 0 0 Ilya Pollak
  150. 150. Example a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 a3 0.25 a4 0.25 a5 0.30 Ilya Pollak
  151. 151. Example a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 110 a3 0.25 a4 0.25 a5 0.30 Ilya Pollak
  152. 152. Example a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 110 a3 0.25 10 a4 0.25 a5 0.30 Ilya Pollak
  153. 153. Example a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 110 a3 0.25 10 a4 0.25 01 a5 0.30 Ilya Pollak
  154. 154. Example a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 110 a3 0.25 10 a4 0.25 01 a5 0.30 00 Ilya Pollak
  155. 155. Example Expected codeword length: 3(0.1) + 3(0.1) + 2(0.25) + 2(0.25) + 2(0.3) = 2.2 bits a1 1 Tree for X: 1 a2 0 a3 a4 a5 letter 0 1 0 0 codeword a1 0.10 111 a2 1 pX(letter) 0.10 110 a3 0.25 10 a4 0.25 01 a5 0.30 00 Ilya Pollak
  156. 156. Self-information •  Consider again a discrete random variable X with M possible outcomes a1,…,aM and with PMF pX. Ilya Pollak
  157. 157. Self-information •  Consider again a discrete random variable X with M possible outcomes a1,…,aM and with PMF pX. •  Self-information of outcome am is I(am) = −log2 pX(am) bits. Ilya Pollak
  158. 158. Self-information •  Consider again a discrete random variable X with M possible outcomes a1,…,aM and with PMF pX. •  Self-information of outcome am is I(am) = −log2 pX(am) bits. •  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at all informative, since it had to occur. The smaller the probability of an outcome, the larger its self-information. Ilya Pollak
  159. 159. Self-information •  Consider again a discrete random variable X with M possible outcomes a1,…,aM and with PMF pX. •  Self-information of outcome am is I(am) = −log2 pX(am) bits. •  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at all informative, since it had to occur. The smaller the probability of an outcome, the larger its self-information. •  Self-information of X is I(X) = −log2 pX(X) and is a random variable. Ilya Pollak
  160. 160. Self-information •  Consider again a discrete random variable X with M possible outcomes a1,…,aM and with PMF pX. •  Self-information of outcome am is I(am) = −log2 pX(am) bits. •  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at all informative, since it had to occur. The smaller the probability of an outcome, the larger its self-information. •  Self-information of X is I(X) = −log2 pX(X) and is a random variable. •  Entropy of X is the expected value of its self-information: M H (X) = E [ I(X)] = − ∑ p X (am )log 2 p X (am ) m =1 Ilya Pollak
  161. 161. Source coding theorem (Shannon) For any uniquely decodable code, the expected codeword length is ≥ H (X). Moreover, there exists a prefix condition code for which the expected codeword length is < H (X) + 1. Ilya Pollak
  162. 162. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. Ilya Pollak
  163. 163. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. •  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Ilya Pollak
  164. 164. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. •  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then 2K ( ) ( ) H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K k =1 Ilya Pollak
  165. 165. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. •  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then 2K ( ) ( ) H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K k =1 •  On the other hand, observe that there exist 2K different K-bit sequences. Thus, a fixed-length code for X that uses all these 2K K-bit sequences as codewords for all the 2K outcomes of X, will have expected codeword length of K. Ilya Pollak
  166. 166. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. •  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then 2K ( ) ( ) H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K k =1 •  On the other hand, observe that there exist 2K different K-bit sequences. Thus, a fixed-length code for X that uses all these 2K K-bit sequences as codewords for all the 2K outcomes of X, will have expected codeword length of K. •  I.e., for this particular random variable, this fixed-length code achieves the entropy of X, which is the lower bound given by the source coding theorem. Ilya Pollak
  167. 167. Example •  Suppose that X has M=2K possible outcomes a1,…,aM. •  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then 2K ( ) ( ) H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K k =1 •  On the other hand, observe that there exist 2K different K-bit sequences. Thus, a fixed-length code for X that uses all these 2K K-bit sequences as codewords for all the 2K outcomes of X, will have expected codeword length of K. •  I.e., for this particular random variable, this fixed-length code achieves the entropy of X, which is the lower bound given by the source coding theorem. •  Therefore, the K-bit fixed-length code is optimal for this X. Ilya Pollak
  168. 168. Lemma 1: An auxiliary result helpful for proving the source coding theorem •  log2α ≤ (α−1) log2e for log2 α > 0. •  Proof: differentiate g(α) = (α−1) log2e − log2α and show that g(1) = 0 is its minimum. Ilya Pollak
  169. 169. Another auxiliary result: Kraft inequality If integers d1 ,…, d M satisfy the inequality M ∑2 − dm ≤ 1, (1) m =1 then there exists a prefix condition code whose codeword lengths are these integers. Conversely, the codeword lengths of any prefix condition code satisfy this inequality. Ilya Pollak
  170. 170. Some useful facts about full binary trees A full binary tree of depth D has 2D leaves. Ilya Pollak
  171. 171. Some useful facts about full binary trees Tree depth D = 4 A full binary tree of depth D has 2D leaves. (Here, depth is D=4 and the number of leaves is 24=16.) Ilya Pollak
  172. 172. Some useful facts about full binary trees Tree depth D = 4 Depth of red node = 2 A full binary tree of depth D has 2D leaves. (Here, depth is D=4 and the number of leaves is 24=16.) In a full binary tree of depth D, each node at depth d has 2D−d leaf descendants. (Here, D=4, the red node is at depth d=2, and so it has 24−2 = 4 leaf descendants.) Ilya Pollak
  173. 173. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Ilya Pollak
  174. 174. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Ilya Pollak
  175. 175. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. Ilya Pollak
  176. 176. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . Ilya Pollak
  177. 177. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if every leaf in the tree is a descendant of a1 ,…, ar , then r ∑2 d M − dm = 2 dM m =1 Ilya Pollak
  178. 178. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if every leaf in the tree is a descendant of a1 ,…, ar , then r ∑2 m =1 d M − dm =2 dM ⇔ r ∑2 − dm =1 m =1 Ilya Pollak
  179. 179. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if every leaf in the tree is a descendant of a1 ,…, ar , then r ∑2 d M − dm =2 r ∑2 ⇔ dM m =1 =1 m =1 M Therefore, − dm ∑2 m =1 − dm r = ∑2 m =1 − dm + M ∑ 2 − dm > 1. This violates (1). m = r +1 Ilya Pollak
  180. 180. Kraft inequality: proof of ⇒ Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times. If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if every leaf in the tree is a descendant of a1 ,…, ar , then r ∑2 d M − dm =2 r ∑2 ⇔ dM m =1 =1 m =1 M Therefore, − dm ∑2 m =1 − dm r = ∑2 m =1 − dm + M ∑ 2 − dm > 1. This violates (1). m = r +1 Thus, our procedure can in fact go on for M iterations. After the M -th iteration, we will have constructed a prefix condition code with codeword lengths d1 ,…, d M . Ilya Pollak
  181. 181. Kraft inequality: proof of ⇐ Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths. Consider the binary tree corresponding to this code. Ilya Pollak
  182. 182. Kraft inequality: proof of ⇐ Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths. Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of depth d M . Ilya Pollak
  183. 183. Kraft inequality: proof of ⇐ Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths. Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of depth d M . Again use the following facts: the full tree has 2 dM leaves; the number of leaf descendants of the codeword of length dm is 2 dM − dm . Ilya Pollak
  184. 184. Kraft inequality: proof of ⇐ Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths. Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of depth d M . Again use the following facts: the full tree has 2 dM leaves; the number of leaf descendants of the codeword of length dm is 2 dM − dm . The combined number of all leaf descendants of all codewords must be less than or equal to the total number of leaves in the full tree: M ∑2 d M − dm ≤ 2 dM m =1 Ilya Pollak
  185. 185. Kraft inequality: proof of ⇐ Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths. Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of depth d M . Again use the following facts: the full tree has 2 dM leaves; the number of leaf descendants of the codeword of length dm is 2 dM − dm . The combined number of all leaf descendants of all codewords must be less than or equal to the total number of leaves in the full tree: M ∑2 m =1 d M − dm ≤2 dM ⇔ M ∑2 − dm ≤ 1. m =1 Ilya Pollak
  186. 186. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. Ilya Pollak
  187. 187. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. M M m =1 m =1 H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm Ilya Pollak
  188. 188. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M M Ilya Pollak
  189. 189. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M M ⎡ ⎛ ⎞⎤ 1 = ∑ p X (am ) ⎢ log 2 ⎜ ⎥ ⎝ p X (am )2 dm ⎟ ⎦ ⎠ m =1 ⎣ M Ilya Pollak
  190. 190. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M ⎡ ⎛ ⎞⎤ 1 = ∑ p X (am ) ⎢ log 2 ⎜ ⎥ ⎝ p X (am )2 dm ⎟ ⎦ ⎠ m =1 ⎣ M ⎛ ⎞ 1 ≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e ⎝ p X (am )2 dm ⎠ m =1 M M (by Lemma 1) Ilya Pollak
  191. 191. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M ⎡ ⎛ ⎞⎤ 1 = ∑ p X (am ) ⎢ log 2 ⎜ ⎥ ⎝ p X (am )2 dm ⎟ ⎦ ⎠ m =1 ⎣ M ⎛ ⎞ 1 ≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e ⎝ p X (am )2 dm ⎠ m =1 M M (by Lemma 1) M ⎛ M 1 ⎞ = ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e ⎝ m =1 2 ⎠ m =1 Ilya Pollak
  192. 192. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M ⎡ ⎛ ⎞⎤ 1 = ∑ p X (am ) ⎢ log 2 ⎜ ⎥ ⎝ p X (am )2 dm ⎟ ⎦ ⎠ m =1 ⎣ M ⎛ ⎞ 1 ≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e ⎝ p X (am )2 dm ⎠ m =1 M M (by Lemma 1) M ⎛ M 1 ⎞ = ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e ⎝ m =1 2 ⎠ m =1 ⎛ M − dm ⎞ = ⎜ ∑ 2 − 1⎟ log 2 e ≤ 0 ⎝ m =1 ⎠ Ilya Pollak
  193. 193. Source coding theorem: proof of H(X)≤E[C] Let dm be the codeword length for am , and let random variable C be the codeword length for X. ⎡ ⎤ ⎛ 1 ⎞ dm H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ − log 2 2 ⎥ ⎝ p X (am ) ⎟ ⎠ m =1 m =1 m =1 ⎣ ⎦ M M ⎡ ⎛ ⎞⎤ 1 = ∑ p X (am ) ⎢ log 2 ⎜ ⎥ ⎝ p X (am )2 dm ⎟ ⎦ ⎠ m =1 ⎣ M ⎛ ⎞ 1 ≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e ⎝ p X (am )2 dm ⎠ m =1 M M (by Lemma 1) M ⎛ M 1 ⎞ = ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e ⎝ m =1 2 ⎠ m =1 ⎛ M − dm ⎞ = ⎜ ∑ 2 − 1⎟ log 2 e ≤ 0 ⎝ m =1 ⎠ By Kraft inequality, this holds for any prefix condition code. But it is also true for any uniquely decodable code. Ilya Pollak
  194. 194. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) Ilya Pollak
  195. 195. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) Ilya Pollak
  196. 196. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Ilya Pollak
  197. 197. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Ilya Pollak
  198. 198. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Also, by construction, dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1 Ilya Pollak
  199. 199. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Also, by construction, dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1 ⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am ) Ilya Pollak
  200. 200. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Also, by construction, dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1 ⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am ) ⇒ M ∑p m =1 M X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am )) m =1 Ilya Pollak
  201. 201. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Also, by construction, dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1 ⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am ) ⇒ M ∑p m =1 M X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am )) m =1 M M m =1 m =1 ⇒ E[C] < ∑ ( − p X (am )log 2 p X (am )) + ∑ p X (am ) Ilya Pollak
  202. 202. Source coding theorem: how to satisfy E[C] < H(X)+1? Choose dm = − ⎡ log 2 p X (am ) ⎤ (where ⎡ x ⎤ stands for the smallest integer which is ≥ x). Then ⎢ ⎥ ⎢ ⎥ dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am ) ⇒ M ∑2 m =1 − dm M ≤ ∑ p X (am ) = 1. m =1 Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword lengths d1 ,…, d M . Also, by construction, dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1 ⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am ) ⇒ M ∑p m =1 M X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am )) m =1 M M m =1 m =1 ⇒ E[C] < ∑ ( − p X (am )log 2 p X (am )) + ∑ p X (am ) = H (X) + 1 Ilya Pollak
  203. 203. Note: Huffman code may often be very far from the entropy •  Let X have two outcomes, a1 and a2, with probabilities 1−2−d and 2−d, respectively. Ilya Pollak
  204. 204. Note: Huffman code may often be very far from the entropy •  Let X have two outcomes, a1 and a2, with probabilities 1−2−d and 2−d, respectively. •  Huffman code: 0 for a1; 1 for a2. •  Expected codeword length: 1. Ilya Pollak
  205. 205. Note: Huffman code may often be very far from the entropy •  Let X have two outcomes, a1 and a2, with probabilities 1−2−d and 2−d, respectively. •  Huffman code: 0 for a1; 1 for a2. •  Expected codeword length: 1. •  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For example, if d=20, this is 0.0000204493. Ilya Pollak
  206. 206. Note: Huffman code may often be very far from the entropy •  Let X have two outcomes, a1 and a2, with probabilities 1−2−d and 2−d, respectively. •  Huffman code: 0 for a1; 1 for a2. •  Expected codeword length: 1. •  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For example, if d=20, this is 0.0000204493. •  Problem: no codeword can have fractional numbers of bits! Ilya Pollak
  207. 207. Note: Huffman code may often be very far from the entropy •  Let X have two outcomes, a1 and a2, with probabilities 1−2−d and 2−d, respectively. •  Huffman code: 0 for a1; 1 for a2. •  Expected codeword length: 1. •  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For example, if d=20, this is 0.0000204493. •  Problem: no codeword can have fractional numbers of bits! •  If we have a source which produces independent random variables X1, X2 , …, all identically distributed to X, a single Huffman code can be constructed for several of them, effectively resulting in fractional numbers of bits per random variable. Ilya Pollak
  208. 208. Example •  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2), with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d, respectively. Ilya Pollak
  209. 209. Example •  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2), with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d, respectively. •  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111 for (a2,a2). Ilya Pollak
  210. 210. Example •  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2), with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d, respectively. •  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111 for (a2,a2). •  Expected codeword length per random variable: –  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2 Ilya Pollak
  211. 211. Example •  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2), with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d, respectively. •  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111 for (a2,a2). •  Expected codeword length per random variable: –  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2 –  This is 0.500001 for d=20 Ilya Pollak
  212. 212. Example •  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2), with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d, respectively. •  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111 for (a2,a2). •  Expected codeword length per random variable: –  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2 –  This is 0.500001 for d=20 •  Can get arbitrarily close to entropy by encoding longer sequences of Xk’s. Ilya Pollak
  213. 213. Source coding theorem for sequences of independent, identically distributed random variables Suppose we are jointly encoding independent, identically distributed discrete random variables X1 ,…, X N , each taking values in {a1 ,…, aN }. For any uniquely decodable code, the expected codeword length is ≥ H (Xn ). Moreover, there exists a prefix condition code for which the expected codeword 1 length is < H (Xn ) + . N Ilya Pollak
  214. 214. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) Ilya Pollak
  215. 215. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is N N n =1 n =1 I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ). Ilya Pollak
  216. 216. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is N N n =1 n =1 I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ). Therefore, the entropy of X is ⎡N ⎤ N H ( X ) = E ⎡ I ( X ) ⎤ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) . ⎣ ⎦ ⎣ n =1 ⎦ n =1 Ilya Pollak
  217. 217. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is N N n =1 n =1 I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ). Therefore, the entropy of X is ⎡N ⎤ N H ( X ) = E ⎡ I ( X ) ⎤ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) . ⎣ ⎦ ⎣ n =1 ⎦ n =1 Therefore, applying the single-symbol source coding theorem to X, we have: H ( X ) ≤ E [ C N ] < H ( X ) + 1, where E [ C N ] is the expected codeword length for the optimal uniquely decodable code for X Ilya Pollak
  218. 218. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is N N n =1 n =1 I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ). Therefore, the entropy of X is ⎡N ⎤ N H ( X ) = E ⎡ I ( X ) ⎤ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) . ⎣ ⎦ ⎣ n =1 ⎦ n =1 Therefore, applying the single-symbol source coding theorem to X, we have: H ( X ) ≤ E [ C N ] < H ( X ) + 1, NH ( Xn ) ≤ E [ C N ] < NH ( Xn ) + 1, where E [ C N ] is the expected codeword length for the optimal uniquely decodable code for X Ilya Pollak
  219. 219. Proof of the source coding theorem for iid sequences Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is N N n =1 n =1 I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ). Therefore, the entropy of X is ⎡N ⎤ N H ( X ) = E ⎡ I ( X ) ⎤ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) . ⎣ ⎦ ⎣ n =1 ⎦ n =1 Therefore, applying the single-symbol source coding theorem to X, we have: H ( X ) ≤ E [ C N ] < H ( X ) + 1, NH ( Xn ) ≤ E [ C N ] < NH ( Xn ) + 1, 1 , N is the expected codeword length for the optimal uniquely decodable code for X, H ( Xn ) ≤ E [C ] < H ( Xn ) + where E [ C N ] E [CN ] and E [ C ] = is the corresponding expected codeword length per symbol. N Ilya Pollak
  220. 220. Arithmetic coding •  Another form of entropy coding. •  More amenable to coding long sequences of symbols than Huffman coding. •  Can be used in conjunction with on-line learning of conditional probabilities to encode dependent sequences of symbols: –  Q-coder in JPEG (JPEG also has a Huffman coding option) –  QM-coder in JBIG –  MQ-coder in JPEG-2000 –  CABAC coder in H.264/MPEG-4 AVC Ilya Pollak

×