Final3

Huffman Data Compression
Joseph S. Lee
May 22, 2011
Abstract
This paper gives a guide for implementing binary compression algo-
rithms through the use of binary trees, and then examines the real-life
uses that binary compression offers.
1 Introduction
The Huffman Algorithm was created in the early 1950s by David Huffman when
he was a PhD student at MIT. At the time, Huffman was a student in an
applied mathematics course and was given the option of either writing a research
paper on finding the most efficient binary compression code or taking the final
exam. Though Huffman decided to research a compression algorithm, he nearly
gave up and began studying for his final exam before he discovered that a
binary-tree frequency-sorting method was extremely efficient in compressing
any given message sequence. Thus, Huffman invented binary compression, a
method of repackaging long messages into shorter encoded messages, which can
be reassembled into the original data without any loss of information. Huffman
published his findings the following year in the Proceedings of the I.R.E. journal
[1].
The Huffman Algorithm was revolutionary in producing the most efficient
method of compressing any text document. The Huffman Algorithm is a lossless
compression method, which signifies that it does not discard any data during
the compression process. Although the Huffman Algorithm is crucial for the
compression of important text files such as articles or bank records, it has not
been incorporated into the compression of video and still images until the intro-
duction of HuffYUV and Adaptive Binary Optimization (ABO), which employ
the Huffman Algorithm in the lossless compression of video and image files.
In Section 2, we illustrate some basic concepts of how compression works
using a simple dictionary method. In Section 3, we show how to implement
the Huffman Algorithm, giving examples of encoding a string of characters into
compressed text. In Section 4, we give an example of a modification of the Huff-
man Algorithm, the Hu-Tucker Algorithm, as demonstrated by Lii [2]. Finally,
in Section 5, we explore how the Huffman Algorithm is implemented in modern
compression programs for videos, photos, and documents.

MIT Undergraduate Journal of Mathematics
2 Basic Compression
Text compression is achieved by eliminating repetitive words in the text and
replacing them with a representive symbol or number. Since text-based data
becomes very large and cumbersome, compressing repeated words and redun-
dant material can greatly reduce the file size. File size is the measurement of
the space a file needs on a computer and is measured in bytes. In this section,
the compression method is a simple dictionary-based compression system. The
dictionary, the key to decoding the compressed file, holds the repeated word in
order of appearance. In the compressed, or encoded, text, the repeated words
will be symbolized by its number in the dictionary. For example, consider the
following message m:
m = “Life is an opportunity, benefit from it. Life is beauty, admire it.”
This text contains 12 words, made up of 52 letters, 11 spaces, and 4 punc-
tuation marks. The total file size of this text is 67 characters long. However,
compression can be achieved by grouping redundant words together. Repeated
words can be easily listed after a quick scan over the text.
1. “life” appears two times;
2. “is” appears two times;
3. “an” appears one time;
4. “opportunity” appears one time;
5. “benefit” appears one time;
6. “from” appears one time;
7. “it” appears two times;
8. “beauty” appears one time;
9. “admire” appears one time.
By using this list of words and their frequency of appearance in the message,
the text can be compressed by creating a dictionary out of these words.
The dictionary for the message is shown in Table 2-1. The file size of the
dictionary is determined by the sum of the length of each word contained in the
dictionary.
The encoded message, C(m), reads as follows, with hyphens signifying spaces:
C(m) = “1-2-3-4,-5-6-7. 1-2-8,-9-7.”
Using the compression dictionary, the encoded message can be easily decoded
to reveal the original message. A decompresser utility simply recreates the
original message by quickly substituting for the encoded values the original
values given in the dictionary.
The effectiveness of compression can be seen through the comparison of file
sizes. The encoded file size includes the dictionary and the encoded message
itself. Without the dictionary, the encoded message cannot be expanded into
the original message.

Huffman Data Compression
n V alue
1 life
2 is
3 an
4 opportunity
5 benefit
6 from
7 it
8 beauty
9 admire
Table 2-1. Compression dictionary.
• File size of C(m) = 27 units
• File size of dictionary = 44 units
• Total compressed file size = 71 units
• Original file size = 67 units
The compressed file size shows that the encoded message is actually larger
than the original message. However, in Table 2.1, the dictionary even shows
words that are only used once in the text. For better compression, the words
that only appear once should remain uncompressed in the message, as they
would demand unnecessary space in the dictionary.
It is possible to compress the message even further by finding redundant
patterns and not restricting dictionary entries to only words. The following
table is a modified dictionary containing repetitive text, with hyphens denoting
spaces:
n V alue
1 Life-is-
2 -it.
Table 2-2. Modified compression dictionary.
The new encoded message (C(m)) is:
C(m) = “1an-opportunity,-benefit-from2-1beauty,-admire2”
• File size of C(m) = 47 units
• File size of dictionary = 12 units
• Total compressed file size = 59 units
• File size of original message = 67 units

Thus, by compressing longer strings of words and omitting single occurrences
of words from the dictionary, the message is compressed further, saving a total
of 8 units of file space.
However, compressing a message using the frequency of appearance of words
does have limitations. For instance, if many words are unrepeated, they will
remain uncompressed. Furthermore, if a message does not contain repeated
words, a dictionary of words would be unable to compress the message. As an
example, consider the following message:
m = “The quick brown fox jumps over a lazy dog.”
The message contains no repeated words and will be left uncompressed by
the compression method using a dictionary of words. Compressing entire words
proves to be difficult, since the dictionary would become large and cumbersome
for a lengthy text file, as there are nearly 200,000 words in current use in the
English language alone. Theoretically, compressing an electronic novel could
require a dictionary containing thousands of words, and still thousands more
words would be left uncompressed. A solution to these limitations would be
to represent each character and letter with binary codes, sized from shortest to
longest according to the frequency of usage in the text. This allows for a suc-
cessful compression of any message, since the most frequently used characters,
which occupy the most space, are denoted by the shortest binary code.
3 Huffman Algorithm
The Huffman Algorithm solves the limitations of the basic method in Section 2
by compressing every single character in a given text. Compression is achieved
by giving each character or letter in the message a binary code. Letters with a
high frequency of appearance are assigned the shortest binary codes since they
use the most file space, and letters that do not show up frequently are given the
longer binary codes.
Let us begin with a simple example. Consider the following message:
m = “ababac”
The file size of the message is determined in binary, therefore the message
“ababc” can be represented in binary code, as each character has a designated
binary number.
m in binary = 01100001 01100010 01100001 01100010 01100001 01100011
After observing the binary form of the message, we can see that the file size
of the original message is 40 bytes long.
In order to compress the message, we can assign the most frequently used
character the shortest binary code, and inversely, assign the least used character
with the longest binary code. For example, the following dictionary in Table
3-1 can be used to compress the original message.

Character Binary Code
a 1
b 01
c 00
Table 3-1. Huffman Dictionary.
As we can see in Table 3-1, the shortest binary code replaces the most fre-
quently used character, “a”. Also, progressively less frequently used characters
are given longer binary codes. Because frequently used letters receive shorter
codes, the expected message length is at most that of the uncompressed message.
In order to assign a binary code of the appropriate length to each letter, the
Huffman Algorithm uses a binary tree like that in Figure 3-1.
6
0 1
3
0 1
1 2 3
c b a
Figure 3-1. A Huffman binary tree with labeled edges.
The binary code for a character is read starting at the root node and record-
ing the binary number of each edge traversed to reach the leaf node of the
character. For example, for the least frequently appearing letter, “c”, combin-
ing the binary numbers of the edges along the path from the root node to the
leaf node of “c” gives us a binary code of “01”. Also, “a”, the most frequently
used character, has the shortest path, a binary code of “1”.
In order to construct the Huffman binary tree, we must first collect the
frequencies of each character appearing in the text. Each of these frequencies
are the leaf nodes in the Huffman binary tree. The Huffman Algorithm encodes
the message using the following steps:
1. Begin with a queue of the frequencies of the characters sorted from least
to greatest.
2. Begin creating a binary tree by removing the two smallest frequency values
from the queue. Add them to create a parent node.
3. Add the parent node back into the queue.
4. Repeat Steps 1–3 until all of the frequencies have been added up to create
the root node.

6
3
1 2 3
c b a
Figure 3-2. A Huffman binary tree.
Now, we can assign the left-sided edges as “0” and the right-sided edges as
“1”. Notice that letters appearing least frequently are combined into parent
nodes first, resulting in a longer path from the root node down to the respective
leaf node of the lowest frequency letter. The smaller the frequency of the letter,
the longer the path. This results in longer binary codes for less frequent letters.
Using the Huffman binary tree, we can find the binary codes for each char-
acter, as shown in the dictionary in Table 3-1. Replacing the characters in
the original message with the binary codes in the dictionary gives us a short,
compressed message C(m).
m = “01100001 01100010 01100001 01100010 01100001 01100011”
C(m) = “1 01 1 01 1 00”
• File size of C(m) = 9 bytes
• File size of dictionary = 5 bytes
• Total compressed file size = 14 bytes
• File size of original message = 48 bytes
Even for such a small and simple message, we can see that the Huffman
Algorithm is effective at compression, reducing to less than 30% of its original
file size.
In the following example, a longer and more complex message will be encoded
with the Huffman Algorithm.
Message = “AABBACCACACACADCFACACFCEFCCBBD”
First, the frequency of appearance of each letter of these letters is recorded:
A: 9 C: 11 E: 1 B: 4 D: 2 F: 3
1. Sort the frequencies into a queue from smallest to largest:
C: 11 A: 9 B: 4 F: 3 D: 2 E: 1
2. Remove the two smallest frequencies in the queue and add them to create
a parent node.
3. Add the parent node frequency to the queue and re-sort the frequencies
from smallest to largest. Repeat steps 1-3 until the root node is formed.

30
11 19
C
9 10
A
4 6
B
3 3
F
1 2
E D
Figure 3-3. The completed binary tree.
Thus, using the Huffman binary tree, we can find the binary codes for each
letter, remembering that every left and right edges are “0” and “1”, respectively.
C: 0 A: 10 B: 110 F: 1111 D: 11101 E: 11100
The original message would have had a 240-byte string of letters, and we
were able to compress the file size to less than 40% of it’s original size with a
simple Huffman binary tree.
By simple observation of the Huffman Binary tree from Figure 3-3, two
points can be made in order to generalize the structure of the binary tree:
1. The shortest frequency symbol is assigned the longest code length. And,
the higher frequency symbols have shorter binary code lengths.
2. There are always two longest length codes, because a binary tree is com-
posed of at least two leaves (lowest frequency symbols) adding together to
form a tree.
Although the Huffman Algorithm is very simple and efficient, its level of
compression can be improved upon. Higher orders the Huffman tree can be
implemented in order to have a more efficient compression. While the Huffman
Algorithm combines two frequencies at every step, a higher order Huffman Algo-
rithm is a tree data structure in which three or more frequencies are combined.
These higher level Huffman trees are called ternary Huffman trees, quaternary
Huffman trees, and so on. An example of a ternary Huffman tree applied on
the previous Huffman example is shown below.

30
9 10 11
A G
3 3 4
F B
0 1 2
Null E D
Figure 3-4. A ternary Huffman tree.
Ternary trees follow the same steps as Huffman binary trees, except three
frequencies are combined at each node. However, the ternary tree cannot be
constructed when the number of nodes is an even number. This problem can
be solved by inserting a null node with zero frequency:
1. If the number of frequency values are even, then add a null node with a
value of “0”.
2. Sort the frequency values from smallest to greatest into a queue.
3. Remove the 3 smallest frequency values from the queue and add them to
create a parent node.
4. Add the parent node frequency back into the queue.
5. Re-sort the frequencies from smallest to largest.
6. Repeat steps 2–5 until the root node is created.
Each node in a ternary Huffman tree has three edges: left, middle, and right.
The left, middle, and right edges can be labeled as the symbols 0, 1, and 2. The
ternary code for each character is shown below:
A: 0 G: 2 F: 11 B: 12 E: 101 D: 102
The message can be compressed by replacing the characters with the ternary
codes. We can see that the compression power of the ternary Huffman algorithm
is greater than that of the Huffman binary algorithm, compressing to a file size
of 55 bytes, in contrast to the compressed size of 89 bytes for the Huffman
binary algorithm.

4 Preservation of Order: Hu-Tucker Algorithm
The Huffman Algorithm is a simple method which does not take into account
the order of the characters in the text because they are sorted with respect
to frequency length. However, the Hu-Tucker Algorithm modifies the Huffman
algorithm so that the entire queue of frequency nodes do not have to be re-sorted
at each step, thus preserving their original order of appearance.
The Hu–Tucker Algorithm minimizes the message length in almost the same
way that the Huffman Algorithm does, but while preserving the order of char-
acters, corresponding to the original ordering of the nodes. Like the Huffman
Algorithm, the Hu–Tucker method also merges the smallest frequency blocks
together, but with different rules based on the positioning of the nodes. This
allows the user to be able to sort the characters in order of appearance in the
original text.
There are two rules to follow when implementing the Hu-Tucker algorithm.
1. Two nodes in the queue can be merged only if there are no nodes between
them.
2. If nodes A and B do not have any nodes between them in the queue and are
the lowest frequency blocks in their vicinity, then they should be merged
into a new node.
Figure 4-1 is a tree created by the Hu-Tucker algorithm. Note that the
smallest, compatible nodes are always merged together before larger ones.
16
7 9
5 2
1 4 2 7 1 1
A B C D E F
Figure 4-1. A Hu-Tucker tree.
The order of appearance is determined by the binary code given to each
character. Like the Huffman binary tree, the left and right edges leading out of
each parent node are labeled as “0” and “1”, respectively. The first appearing
character is symbolized by the binary code with the largest number of zeros and
least number of ones, zeros leading ones. For instance, in our example, “A” is
symbolized by “000” and appears in the text before “E”, which is symbolized
by the code “110”. As further example, a symbol with a code of “01” appears
before a symbol “100”.
5 New Technology using the Huffman Algorithm
Currently, lossless compression utilities such as jpeg and mpeg are widely used
by the general public. However, as the name “lossless” suggests, they compress

information while allowing a loss of information. However, new compression
programs for videos, images, and text have been designed using the Huffman
Algorithm, which does allow any loss of information. The Huffman Algorithm
is easy to implement for text files, but requires modification to work with image
and video files. Two current compression programs using the Huffman Algo-
rithm are HuffYUV and ABO (Adaptive Binary Optimization). These programs
have gained much attention to their speed of lossless compression, since lossless
compression is widely preferred to lossy compression.
HuffYUV is a video codec known for its speedy compression of videos. It
is somewhat incorrectly named, as it does not perform the Huffman algorithm
on YUV (a color scheme adapted from the traditional RGB, Red/Green/Blue
color space). HuffYUV compresses a color space called “YCbCr” which is used
in video systems, employing brightness components (Y, luma) coupled with red
and blue chroma components (CbCr).
This codec is one of the best in terms of speed and efficiency. Thus, people
watching a video compressed by a HuffYUV (or any other lossless video codec)
can enjoy a high-quality video, while reveling in the fact that the video takes
up little space on the computer.
In terms of video codecs, there are much slower ones other than HuffYUV,
but sometimes they are chosen over HuffYUV for their different functions or
better compression. One popular codec for videos is “Lagarith,” which has a
comparable encoding speed to most other codecs. Lagarith is favorable in terms
of options because it has the capability of supporting most types of color space,
including RGB, RGBA, YV12, and YUY2. Although decoding is much slower
than most codecs, Lagarith makes up for the speed by allow the decoding of
separate video frames, allowing the video to be edited smoothly by cutting and
splicing frames. Huffman codecs are the fastest encoders and decoders to date,
because of the overwhelming simplicity of binary coding.
Image compressors also utilize the Huffman Algorithm, although it is not as
crucial that the speed of compression be extremely fast, since image files are
small in size comparatively to video and music files. Adaptive Binary Optimiza-
tion (ABO) is a variation on the Huffman Algorithm, able to compress image
files without a loss of information.
ABO created waves in the technological world by challenging the jpeg com-
pression process. Created in Singapore, ABO is now a favored image compressor
for hospitals and document processing companies, since it can reduce the file size
of an image without a loss of information in the final output of the decompresser.
MatrixView, the company that developed ABO, is working in partnership
with the KK Women’s and Children’s Hospital in Singapore, helping them to
archive their library of ultrasound images and videos. According to Arvind Thi-
agarajan, MatrixView’s founder, ABO can compress image files up to 32 times
the ratio of JPEG compression, and maintain a lossless stream of information
[3].
With the help of Huffman’s ideas, ABO compression could very well be the
premier technology for videos and images within the next few years. Lossless
compression is the most favorable filing method, and Huffman’s algorithm is
essential in allowing that to happen.

References
[1] Huffman, D.A., “A method for the construction of minimum-redundancy
codes,” Proceedings of the I.R.E., 1952.
[2] Lii, J., “Finding Efficient Compressions; Huffman and Hu Tucker Algo-
rithms,” 18.310 Lecture Notes, MIT, 2004.
[3] Lui, J., “ABO aims to out-compress JPEG,” ZDNet, 2003.

Final3

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (12)

Similar to Final3

Similar to Final3 (20)

Final3