Huffman Encoding Pr


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Huffman Encoding Pr

  1. 1. Huffman Encoding Βαγγέλης Δούρος EY0619 1
  2. 2. Text Compression On a computer: changing the representation of a file so that it takes less space to store or/and less time to transmit. – original file can be reconstructed exactly from the compressed representation. different than data compression in general – text compression has to be lossless. – compare with sound and images: small changes and noise is tolerated. 2
  3. 3. First Approach Let the word ABRACADABRA What is the most economical way to write this string in a binary representation? Generally speaking, if a text consists of N different characters, we need ⎡log N ⎤ bits to ⎢ ⎥ represent each one using a fixed-length encoding. Thus, it would require 3 bits for each of 5 different letters, or 33 bits for 11 letters. Can we do it better? 3
  4. 4. Yes!!!! We can do better, provided: – Some characters are more frequent than others. – Characters may be different bit lengths, so that for example, in the English alphabet letter a may use only one or two bits, while letter y may use several. – We have a unique way of decoding the bit stream. 4
  5. 5. Using Variable-Length Encoding (1) Magic word: ABRACADABRA LET A = 0 B = 100 C = 1010 D = 1011 R = 11 Thus, ABRACADABRA = 01001101010010110100110 So 11 letters demand 23 bits < 33 bits, an improvement of about 30%. 5
  6. 6. Using Variable-Length Encoding (2) However, there is a serious danger: How to ensure unique reconstruction? Let A 01 and B 0101 How to decode 010101? AB? BA? AAA? No problem… if we use prefix codes: no codeword is a prefix of another codeword. 6
  7. 7. Prefix Codes (1) Any prefix code can be represented by a full binary tree. Each leaf stores a symbol. Each node has two children – left branch means 0, right means 1. codeword = path from the root to the leaf interpreting suitably the left and right branches. 7
  8. 8. Prefix Codes (2) ABRACADABRA A=0 B = 100 C = 1010 D = 1011 R = 11 Decoding is unique and simple! Read the bit stream from left to right and starting from the root, whenever a leaf is reached, write down its symbol and return to the root. 8
  9. 9. Prefix Codes (3) Let fi the frequency of the i-th symbol , di the number of bits required for the i-th symbol(=the depth of this symbol in tree), 1 ≤ i ≤ n How do we find the optimal coding tree, n which minimizes the cost of tree C = ∑ f d ? i=1 i i – Frequent characters should have short codewords – Rare characters should have long codewords 9
  10. 10. Huffman’s Idea From the previous definition of the cost of tree, it is clear that the two symbols with the smallest frequencies must be at the bottom of the optimal tree, as children of the lowest internal node, isn’t it? This is a good sign that we have to use a bottom-up manner to build the optimal code! Huffman’s idea is based on a greedy approach, using the previous notices. Repeat until all nodes merged into one tree: – Remove two nodes with the lowest frequencies. – Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their frequencies as the new frequency. 10
  11. 11. Constructing a Huffman Code (1) Assume that frequencies of symbols are: – A: 40 B: 20 C: 10 D: 10 R: 20 Smallest numbers are 10 and 10 (C and D), so connect them 11
  12. 12. Constructing a Huffman Code (2) C and D have already been used, and the new node above them (call it C+D) has value 20 The smallest values are B, C+D, and R, all of which have value 20 – Connect any two of these It is clear that the algorithm does not construct a unique tree, but even if we have chosen the other possible connection, the code would be optimal too! 12
  13. 13. Constructing a Huffman Code (3) The smallest value is R, while A and B+C+D have value 40. Connect R to either of the others. 13
  14. 14. Constructing a Huffman Code(4) Connect the final two nodes, adding 0 and 1 to each left and right branch respectively. 14
  15. 15. Algorithm X is the set of symbols, whose frequencies are known in advance Q is a min-priority queue, implemented as binary-heap -1 15
  16. 16. What about Complexity? Thus, the algorithm needs Ο(nlogn) needs O(nlogn) -1 Thus, the loop needs O(nlogn) needs O(logn) needs O(logn) needs O(logn) 16
  17. 17. Algorithm’s Correctness It is proven that the greedy algorithm HUFFMAN is correct, as the problem of determining an optimal prefix code exhibits the greedy- choice and optimal-substructure properties. Greedy Choice :Let C an alphabet in which each character c Є C has frequency f[c]. Let x and y two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Optimal Substructure :Let C a given alphabet with frequency f[c] defined for each character c Є C . Let x and y, two characters in C with minimum frequency. Let C’ ,the alphabet C with characters x,y removed and (new) character z added, so that C’ = C – {x,y} U {z}; define f for C’ as for C, except that f[z] = f[x] + f[y]. Let T’ ,any tree representing an optimal prefix code for the alphabet C’. Then the tree T, obtained from T’ by replacing the leaf node for z with an internal node having x and y as children, represents an optimal prefix code for the alphabet C. 17
  18. 18. Last Remarks • "Huffman Codes" are widely used applications that involve the compression and transmission of digital data, such as: fax machines, modems, computer networks. • Huffman encoding is practical if: – The encoded string is large relative to the code table (because you have to include the code table in the entire message, if it is not widely spread). – We agree on the code table in advance • For example, it’s easy to find a table of letter frequencies for English (or any other alphabet-based language) 18
  19. 19. Ευχαριστώ! 19