Proof of Kraft-McMillan theorem∗                                                                            Nguyen Hung Vu...
The last inequality results from the hypothesis of the theo-          P                                        P          ...
(a)   {0, 01, 11, 111}( 01 11 = 0 111)                             In this alphabet, a3 and a4 are the two letters at the ...
The reduced four–letter alphabet                         6.6 Extended Huffman code               Letter Probability Codewo...
6.7 Nonbinary Huffman codes6.8 Adaptive Huffman Coding7 Arithmetic coding  4      Use the mapping                        X...
Upcoming SlideShare
Loading in …5
×

Proof of Kraft-McMillan theorem

917 views
818 views

Published on

Proof of Kraft-McMillan theorem (vuhung, Nguyễn Vũ Hưng)

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
917
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Proof of Kraft-McMillan theorem

  1. 1. Proof of Kraft-McMillan theorem∗ Nguyen Hung Vu† 平成 16 年 2 月 24 日1 Kraft-McMillan theorem This means that Let C be a code with n codewords with lenghth l1 , l2 , ..., lN . If nl X nl XC is uniquely decodable, then K(C)n = Ak 2−k ≤ 2k 2−k = nl − n + 1 (3) k=n k=n N X K(C) = 2−li ≤ 1 (1) But K(C) is greater than one, it will grow exponentially with i=1 n, while n(l − 1) + 1 can only grow linearly, So if K(C) is greater than one, we can always find an n large enough that the inequal- ity (3) is violated. Therefore, for an uniquely decodable code C, K(C) is less than or equal ot one.2 Proof This part of of Kraft-McMillan inequality provides a necces- The proof works by looking at the nth power of K(C). If K(C) sary condition for uniquely decodable codes. That is, if a codeis greater than one, then K(C)n should grw expoentially with n. is uniquely decodable, the codeword lengths have to satisfy theIf it does not grow expoentially with n, then this is proof that inequality.PN −li i=1 2 ≤ 1. Let n be and arbitrary integer. Then hP in N −li 3 Construction of prefix code i=1 2 = “P ” = N −li1 Given a set of integers l1 , l2 , · · · , lN that satisfy the inequality i1 =1 2 ··· “P ” “P ” N N −li2 N −lin X i2 =1 2 iN =1 2 2−li ≤ 1 (4) and then i=1 "N #n N N N we can always find a prefix code with codeword lengths X −li X X X l1 , l2 , · · · , lN . 2 = ··· 2−(li1 +i2 +···+in ) (2) i=1 i1 =1 i2 =1 in =1 The exponent li1 + li2 + · · · + lin is simply the length of ncodewords from the code C. The smallest value this exponent 4 Proof of prefix code constructingcan take is greater than or equal to n, which would be the case ifcodwords were 1 bit long. If theorem We will prove this assertion by developing a procedure for con- l = max{l1 , l2 , · · · , lN } structing a prefix code with codeword lengths l1 , l2 , · · · , lN that then the largest value that the exponent can take is less than or satisfy the given inequality. Without loss of generality, we canequal to nl. Therefore, we can write this summation as assume that l1 ≤ l2 ≤ · · · ≤ lN . (5) nl X Define a sequence of numbers w1 , w2 , · · · , wN as follows: K(C)n = Ak 2−k n w1 = 0 Pj−1 lj −li where Ak is the number of combinations of n codewords that wj = i=1 2 j > 1.have a combined length of k. Let’s take a look at the size of this The binary representation of wj for j > 1 would take upcoeficient. The number of possibledistinct binary sequences of log2 wj bits. We will use this binary representation to constructlength k is 2k . If this code is uniquely decodable, then each se- a prefix code. We first note that the number of bits in the binaryquence can represent one and only one sequence of codewords. representation of wj is less than or equal to lj . This is obviouslyTherefore, the number of possible combinations of codewords true for w1 . For j > 1,whose combined length is k cannot be greater than 2k . In other hP iwords, log2 wj = log2 j−1 lj −li 2 h i=1P i Ak ≤ 2k . = log2 2lj j−1 2−li i=1 hP i j−1 −li ∗ Introduction to Data Compression, Khalid Sayood = lj + log2 i=1 2 † email: vuhung@fedu.uec.ac.jp ≤ lj . 1
  2. 2. The last inequality results from the hypothesis of the theo- P P i2 , · · · , Xn = in )P And second–order entropy is defined .rem that N 2−li ≤ 1, which implies that j−1 2li ≤ 1. i=1 i=1 by H2 (S) P − P (X1 ) log P (X1 ). In case of iid, = i1 =mAs the logarithm of a˜number less than one is negative, lj + ˆP Gn = −n i1 =1 P (X1 = i1 ) log P (X1 = i1 ). Prove it! j−1 −lilog2 i=1 2 has to be less than lj . 3. Given an alphabet A = {a1 , a2 , a3 , a4 }, find the first–order Using the binary representation of wj , we can devise a binary entropy in the following cases:code in the following manner: If log2 wj = lj , then the jthcodeword cj is the binary representation of wj . If log2 wj < lj , (a) P (a1 ) = P (a2 ) = P (a3 ) = P (a4 ) = 1 . 4then cj is the binary representation of wj , with lj − log2 wj (b) P (a1 ) = 1 , P (a2 ) = 1 , P (a3 ) = P (a4 ) = 1 . 2 4 8zeros appended to the right. This certainly a code, but is it a prefix 1 1code? If we can show that the code C = {c1 , c2 , · · · , cN } is a (c) P (a1 ) = 0.505, P (a2 ) = 4 , P (a3 ) = 8 , P (a4 ) =prefix, then we will have proved the theorem by construction. 0.12. Suppose that our claim is not true.. The for some j < k, cj is a 4. Suppose we have a source with a probability model P =refix of ck . This means that lj most significant bits of wk form the {p0 , p1 , · · · , pm } and entropy HP . Suppose we have an-binary representation of wj . Therefore if we right–shift the binary other sourcewith probability model Q = {q0 , q1 , · · · qm }of wk by lk − lj bits, we should get the binary representation for and entroy HQ , wherewj . We can write this as qi = pi i = 0, 1, · · · , j − 2, j + 1, · · · , m wk wj = . 2lk −lj and However, pj +pj−1 qj = qj−1 = 2 k−1 X How is HQ related to HP ( greater, equal, or less)? Prove wk = 2lk −li . your answer. i=1 P P Hints: HQ − HP = pi log pi − qi log qi = Therefore, pj−1 +pj p +p pj−1 log pj−1 + pj log pj − 2 2 log j−1 j = 2 pj−1 +pj φ(pj−1 ) + φ(pj ) − 2φ( 2 ). Where φ(x) = x log x k−1 X and φ (x) = 1/x > 0 ∀x > 0. wk = 2lj −li 2lk −lj i=0 5. There are several image and speech files among the accom- k−1 panying data sets. X = wj + 2lj −li (a) Write a program to compute the first–order entropy of i=j (6) some of the image and speech files. k−1 X (b) Pick one of the image files and compute is second– = wj + 20 + 2lj −lk i=j+1 order entropy. Comment on the difference between the first–order and second–order entropies. ≥ wj + 1 (c) Compute the entropy of the difference between neigh- wk boring pixels for the image you used in part (b). Com- That is, the smalleset value for lk −lj is wj + 1. This constra- 2dicts the requirement for cj being the prefix of ck . Therefore, cj ment on what you discovered.cannot be the prefix of ck . As j and k were arbitrary, this means 6. Conduct an experiemnt to see how well a model can describethat no codword is a prefix of another codeword, and the code C a source.is a prefix code. (a) Write a program that randomly selects letters from a 26–letter alphabet {a, b, c, · · · , z} and forms four– letter words. Form 100 such words and see how many5 Projects and Problems of these words make sence. (b) Among the accompanying data sets is a file called 1. Suppose X is a random variable that takes on values from an 4letter.words , which contains a list of four– M–letter alphabet. Show that 0 ≤ H(X) ≤ log2 M . P letter words. Using this file, obtain a probability Hints: H(X) = − n P (αi ) log P (αi ) where, M = i=1 model for the alphabet. Now repeat part (a) gener- {α1 × m1 , α2 × m2 , · · · , αn × mP αi = αj ∀i = Pn n }, ating the words using probability model. To pick let- j, i=1 mi = M . H(X) = − n PMi log mi = Pn i=1 m M ters according to a probability model, construct the 1 1 − M i=1 mi (log mi − log M ) = − M mi log mi + cumulative density function (cdf )FX (x)2 . Using a 1 P 1 P log M mi = − M mi log mi + log M uniform pseudorandom number generator to generate M a value r, where 0 ≤ r < 1, pick the letter xk if 2. Show that for the case where the elements of an observed FX (xk − 1) ≤ r < FX (xk ). Compare your results sequence are idd 1 ,the entropy is equal to the first-order with those of part (a). entropy. (c) Repeat (b) using a single–letter context. Hints: First–order entropy is defined by: H1 (S) = 1 limn→∞ n Gn , where Gn = (d) Repeat (b) using a two–letter context. P P P − i1 =m i2 =m · · · in =1 in = mP (X1 i1 =1 i2 =1 = 7. Determine whether the following codes are uniquely decod- i1 , X2 = i2 , · · · , Xn = in ) log P (X1 = i1 , X2 = able: 1 iid: independent and identical distributed 2 cumulative distribution function (cdf ) FX (x) = P (X ≤ x)
  3. 3. (a) {0, 01, 11, 111}( 01 11 = 0 111) In this alphabet, a3 and a4 are the two letters at the bottom of (b) {0, 01, 110, 111} (01 110 = 0 111 0) the sorted list. We assign their codewords as (c) {0, 10, 110, 111}, Yes. Prove it! c(a3 ) = α2 ∗ 0 (d) {1, 10, 110, 111}, (1 10 = 110) c(a4 ) = α2 ∗ 1 8. Using a text file compute the probabilities of each letter pi . but c(a4 ) = α1 . Therefore (a) Assume that we need a codeword of length log1 pi 2 to encode the letter i. Determine the number of bits needed to encode the file. α1 = α2 ∗ 1 (b) Compute the conditional probability P (i/j) of a let- which means that ter i given that the previous letter is j. Assume that 1 we need log P (i/j) to represent a letter i that fol- c(a4 ) = α2 ∗ 10 2 lows a letter j. Determine the number of bits needed c(a5 ) = α2 ∗ 11. to encode the file. At this stage, we again define a new alphabet A that consists of three letters a1 , a2 , a3 , where a3 is composed of a3 and a4 and has a probability P (a3 ) = P (a3 ) + P (a4 ) = 0.4. We sort this6 Huffman coding new alphabet in descending order to obtain the following table. The reduced four–letter alphabet6.1 Definition Letter Probability Codewords A minimal variable-length character encoding based on the fre- a2 0.4 c(a2 )quency of each character. First, each character becomes a trivial a3 0.4 α2tree, with the character as the only node. The character’s fre- a1 0.2 c(a1 )quency is the tree’s frequency. The two trees with the least fre-quencies are joined with a new root which is assigned the sum In this case, the two least probable symbols are a1 and a3 .of their frequencies. This is repeated until all characters are in Therefore,one tree. One code bit represents each level. Thus more frequentcharacters are near the root and are encoded with few bits, and c(a3 ) = α3 ∗ 0rare characters are far from the root and are encoded with many c(a1 ) = α3 ∗ 1bits. But c(a3 ) = α2 . Therefore,6.2 Example of Huffman encoding design α2 = α3 ∗ 0 which means that. Alphabet A = {a1 , a2 , a3 , a4 } with P (a1 ) = P (a2 ) = 0.2and P (a4 ) = P (a5 ) = 0.1. The entropy of this source is c(a3 ) = α3 ∗ 002.122bits/symbol. To design the Huffman code, we first sort the c(a4 ) = α4 ∗ 010letters in a descending probability order as shown in table below. c(a5 ) = α5 ∗ 011.Here c(ai ) denotes the codewords as Again we define a new alphabet, this time with only two let- c(a4 ) = α1 ∗ 0 ter a3 , a2 . Here a3 is composed of letters a3 and a1 and has c(a5 ) = α1 ∗ 1 probability P (a3 ) = P (a3 ) + P (a1 ) = 0.6. We now have the where α1 is a binary string, and * denotes concatnation. following table The initial five–letter alphabet Letter Probability Codeword The reduced four–letter alphabet a2 0.4 c(a2 ) Letter Probability Codewords a1 0.2 c(a1 ) a3 0.6 α3 a3 0.2 c(a3 ) a2 0.4 c(a2 ) a4 0.1 c(a4 ) a5 0.1 c(a5 ) As we have only two letters, the codeword assignment is straghtforward: Now we define a new alphabe A with a four–lettera1 , a2 , a3 , a4 where a4 is composed of a4 and a5 and has a proba- c(a3 ) = 0bility P (a4 ) = P (a4 ) + P (a5 ) = 0.2. We sort this new alphabet c(a2 ) = 1in descending order to obtain following table. which means that α3 = 0, which in turn means that The reduced four–letter alphabet Letter Probability Codewords c(a1 ) = 01 a2 0.4 c(a2 ) c(a3 ) = 000 a1 0.2 c(a1 ) c(a4 ) = 0010 a3 0.2 c(a3 ) c(a5 ) = 0011 a4 0.2 α1 and the Huffman finally is given here.
  4. 4. The reduced four–letter alphabet 6.6 Extended Huffman code Letter Probability Codewords a2 0.4 1 Consider an alphabet A = {a1 , a2 , · · · , am }. Each element of a1 0.2 01 the sequence is generated independently of the other elements in a3 0.2 000 the sequence. The entropyo f this source is give by a4 0.1 0010 N X a5 0.1 0011 H(S) = − P (ai ) log2 P (ai ) i=1 We know that we can generate a Huffman code for this source6.3 Huffman code is optimal with rate r such that The neccessary conditions for an optimal variable binary code H(S) ≤ R < H(S) + 1 (8)are as follows: We have used the looser bound here; the same argument can be made with the tighter bound. Notice thatwe have used ”rate 1. Condition 1: Give any two letter aj and ak , if P (aj ) ≥ R” to denote the number of bits per symbol. This is a standard P (ak ), then lj ≤ lk , where lj is the number of bits in the convention in the data compression literate. However, in the co- codeword for aj . munication literature, the word ”rate” ofter refers to the number 2. Condition 2: The two least probable letters have codewords of bits per second. with the same maximum length lm . Suppose we now encode the sequence by generating one code- word for every n symbols. As there are mn combinations of 3. Condition 3: In the tree corresponding to the optimum code, n symbols, we will need mn codewords in our Huffman code. there must be two branches stemming3 from each interme- We could generate this code by viewing mn symbols as let- diate node. ntimes z }| { 4. Condition 4: Suppose we change an intermediate node into ters of an extended alphabet. A(n) = {a1 a1 ...a1 , a1 a1 ...a2, a leaf node by combining all the leaves descending from ..., a2, a1 a1 , ...am , a1 a1 ...a2 a1, ..., am am ...am } from a source it into a composite word of a reduced alphabet. Then, if S (n) . Denote the rate for the new source as R(n) . We know the original tree was optimal for the original alphabet, the H(S (n) ) ≤ R(n) < H(S (n) ) + 1. (9) reduced tree is optimal for reduced alphabet. The Huffman code satisfies all conditions above and therefore 1 (n) R= Rbe optimal. n and H(S (n) ) H(S (n) ) 1 ≤R< +6.4 Average codeword length n n n It can be proven that For a source S with alphabet A = {a1 , a2 , · · · , aK }, and prob-ability model {P (a1 ), P (a2 ), · · · , P (aK )}, the average code- H(S (n) ) = nH(S)word length is given by and therefore 1 K X H(S) ≤ R ≤ H(S) + ¯= n l P (ai )li . i=1 The difference between the entropy ofthe sourcce H(S) and the 6.6.1 Example of Huffman Extended Codeaverage length is A = {a1 , a2 , a3 } with probabilities model P (a1 ) = 0.8, P (a2 ) = 0.02 and P (a3 ) = 0.18. The entropy for this P P H(S) − ¯ = l − K P (ai )h 2 P (ai ) i” K P (ai )li log − i=1 source is 0.816 bits per symbol. Huffman code this this source is P i=1 “ 1 shown below = P (ai ) log P (ai ) − li P “ h i ” = P (ai ) log P (ai ) − log2 (2li ) 1 a1 0 P h −l i a2 11 2 i = P (ai ) log P (ai ) ˆP l ˜ a3 10 ≤ log2 2i and the extended code is a1 a1 0.64 06.5 Length of Huffman code a1 a2 0.016 10101 a1 a3 0.144 11 It has been proven that a2 a1 0.016 101000 a2 a2 0.0004 10100101 H(S) ≤ ¯ < H(S) + 1 (l) (7) a2 a3 0.0036 1010011 a3 a1 0.1440 100where H(S) is the entropy of the source S. a3 a2 0.0036 10100100 3 生じる a3 a3 0.0324 1011
  5. 5. 6.7 Nonbinary Huffman codes6.8 Adaptive Huffman Coding7 Arithmetic coding 4 Use the mapping X(ai ) = i, ai ∈ A (10)where A = {a1 , a2 , ...am } is the alphabet for a discrete sourceand X is a random variable. This mapping means that given aprobability model mathcalP for the source, we also have a prob-ability density function for the random variable P (X = i) = P (ai )and the curmulative density function can be defined as i X FX (i) = P (X = k). k=1 ¯ Define TX (ai ) as i−1 X ¯ 1 TX (ai ) = P (X = k) + P (X = i) (11) 2 k=1 1 = FX (i − 1) + P (X = i) (12) 2 ¯For each ai , TX (ai ) will have a unique value. This value can beused as a unique tag for ai . In general X 1 ¯(m) TX (xi ) = P (y) + P (xi ) (13) y<xi 2 4 Arithmetic: 算数, 算術

×