Background• When we encode characters in computers, we assign each an 8-bit code based on an ASCII chart.• But in most files, some characters appear more often than others.• So wouldnt it make more sense to assign shorter codes for characters that appear more often and longer codes for characters that appear less often?
Data Compressions• Lossless compressions:Less space is used, or energy to transmit it.–No information is lost –we can reconstruct theoriginal text–Often: be able to quickly reconstruct original text• Lossy compressions.
Data Compressions• Text representation:• ASCII: 1 byte per character (fixed length code) – “M A R C” 1st Byte 2nd Byte 3rd Byte 4th byte 77 97 114 99 01001101 01100001 01110010 01100011
• In the English language, the letter „e‟ is, „quite common x‟ is less common–Why should all characters be represented using the same number of bits?
Huffman• 1951, David Huffman found the “most efficient method of representing numbers, letters, and other symbols using binary code”• Now standard method used for data compression
The Concept• Huffman coding has the following properties:• Codes for more probable characters are shorter than ones for less probable characters.• Each code can be uniquely decoded .
The Concept• To read the codes from a Huffman tree, start from the root and add a 0 every time you go left to a child, and add a 1 every time you go right. So in this example, the code for the character b is 01 and the code for d is 110.• As you can see, a has a shorter code than d. Notice that since all the characters are at the leafs of the tree (the ends), there is never a chance that one code will be the prefix of another one (eg. a is 00 and b is 011• ).• Hence, this unique prefix property assures that each code can be uniquely decoded.
Huffman Codes• How to construct Huffman codes?• Build a binary tree!
Binary Tree• A binary tree is a tree where each internal node has at most 2 children.
Internal The “Root Node” A VertexThe “Left Node” B C The “Right Node” D E F G Sibling Nodes Leaf Binary Tree Vertex
Algorithm1.Take the characters and their frequencies, and this list by increasing frequency.2 .All the characters are vertices of the tree.3.Take the first 2 vertices from the list and make them children of a vertex having the sum of their frequencies.4.Insert the new vertex into the sorted list of vertices waiting to be put into the tree.5.If there are at least 2 vertices in the list, go to step 3.6.Read the Huffman code from the tree.
Huffman’s Code1.Take the characters and their frequencies, and sort this list by increasing frequencyA: 3, O: 5, T: 7, E:10
Huffman’s Code2. All the characters are the vertices of the tree: A:3 O:5 T:7 E:10
Huffman’s Code3 .Take the first two vertices from the list and make them children of a vertex having the sum of their frequencies *:8 A:3 O:5
Huffman’s Code4 .Insert the new vertex into the sorted list of vertices waiting to be put into tree T:7 E:10• List of remaining vertices• New list, with the new vertex T:7 8 E:10 inserted
Huffman’s Code *: 155. Take the first 2 vertices from the list and make them T:7 *: 8Children of the vertex having the sum of their frequencies. A:3 O:5
Huffman’s Code6. Insert the new vertex into the sorted list of vertices waiting to be put into tree.• List of remaining vertices E:10• New list, with the E:10 * : 15 new vertex inserted
*:25 New vertex with frequency 10 +15 = 257. Take the E:10 *:15first twoVertices fromthe list and makethem children of T:7 *:8 the vertex having the sum of their frequencies. A:3 O:5
Huffman’ *:25 s Code 0 1 E:10 *:15• Left branch = 0 1 0• Right branch = 1• Huffman Code T:7 *:8E:0T:10 1 0A:110O: 111 A:3 O:5
Advantage of Huffman Codes• Reduce size of data by 20%-90% in general• If no characters occur more frequently than others, then no advantage over ASCII.• Encoding: –Given the characters and their frequencies, perform the algorithm and generate a code. Write the characters using the code.• Decoding: Given the Huffman tree, figure out what each character is (possible because of prefix property).
Applications• Both the .mp3 and .jpg file formats use Huffman coding at one stage of the compression.• Alternative method that achieves higher compression but is slower is patented by IBM, making Huffman Codes attractively.