SlideShare a Scribd company logo
1 of 11
Download to read offline
Huffman Data Compression
Joseph S. Lee
May 22, 2011
Abstract
This paper gives a guide for implementing binary compression algo-
rithms through the use of binary trees, and then examines the real-life
uses that binary compression offers.
1 Introduction
The Huffman Algorithm was created in the early 1950s by David Huffman when
he was a PhD student at MIT. At the time, Huffman was a student in an
applied mathematics course and was given the option of either writing a research
paper on finding the most efficient binary compression code or taking the final
exam. Though Huffman decided to research a compression algorithm, he nearly
gave up and began studying for his final exam before he discovered that a
binary-tree frequency-sorting method was extremely efficient in compressing
any given message sequence. Thus, Huffman invented binary compression, a
method of repackaging long messages into shorter encoded messages, which can
be reassembled into the original data without any loss of information. Huffman
published his findings the following year in the Proceedings of the I.R.E. journal
[1].
The Huffman Algorithm was revolutionary in producing the most efficient
method of compressing any text document. The Huffman Algorithm is a lossless
compression method, which signifies that it does not discard any data during
the compression process. Although the Huffman Algorithm is crucial for the
compression of important text files such as articles or bank records, it has not
been incorporated into the compression of video and still images until the intro-
duction of HuffYUV and Adaptive Binary Optimization (ABO), which employ
the Huffman Algorithm in the lossless compression of video and image files.
In Section 2, we illustrate some basic concepts of how compression works
using a simple dictionary method. In Section 3, we show how to implement
the Huffman Algorithm, giving examples of encoding a string of characters into
compressed text. In Section 4, we give an example of a modification of the Huff-
man Algorithm, the Hu-Tucker Algorithm, as demonstrated by Lii [2]. Finally,
in Section 5, we explore how the Huffman Algorithm is implemented in modern
compression programs for videos, photos, and documents.
MIT Undergraduate Journal of Mathematics
2 Basic Compression
Text compression is achieved by eliminating repetitive words in the text and
replacing them with a representive symbol or number. Since text-based data
becomes very large and cumbersome, compressing repeated words and redun-
dant material can greatly reduce the file size. File size is the measurement of
the space a file needs on a computer and is measured in bytes. In this section,
the compression method is a simple dictionary-based compression system. The
dictionary, the key to decoding the compressed file, holds the repeated word in
order of appearance. In the compressed, or encoded, text, the repeated words
will be symbolized by its number in the dictionary. For example, consider the
following message m:
m = “Life is an opportunity, benefit from it. Life is beauty, admire it.”
This text contains 12 words, made up of 52 letters, 11 spaces, and 4 punc-
tuation marks. The total file size of this text is 67 characters long. However,
compression can be achieved by grouping redundant words together. Repeated
words can be easily listed after a quick scan over the text.
1. “life” appears two times;
2. “is” appears two times;
3. “an” appears one time;
4. “opportunity” appears one time;
5. “benefit” appears one time;
6. “from” appears one time;
7. “it” appears two times;
8. “beauty” appears one time;
9. “admire” appears one time.
By using this list of words and their frequency of appearance in the message,
the text can be compressed by creating a dictionary out of these words.
The dictionary for the message is shown in Table 2-1. The file size of the
dictionary is determined by the sum of the length of each word contained in the
dictionary.
The encoded message, C(m), reads as follows, with hyphens signifying spaces:
C(m) = “1-2-3-4,-5-6-7. 1-2-8,-9-7.”
Using the compression dictionary, the encoded message can be easily decoded
to reveal the original message. A decompresser utility simply recreates the
original message by quickly substituting for the encoded values the original
values given in the dictionary.
The effectiveness of compression can be seen through the comparison of file
sizes. The encoded file size includes the dictionary and the encoded message
itself. Without the dictionary, the encoded message cannot be expanded into
the original message.
Huffman Data Compression
n V alue
1 life
2 is
3 an
4 opportunity
5 benefit
6 from
7 it
8 beauty
9 admire
Table 2-1. Compression dictionary.
• File size of C(m) = 27 units
• File size of dictionary = 44 units
• Total compressed file size = 71 units
• Original file size = 67 units
The compressed file size shows that the encoded message is actually larger
than the original message. However, in Table 2.1, the dictionary even shows
words that are only used once in the text. For better compression, the words
that only appear once should remain uncompressed in the message, as they
would demand unnecessary space in the dictionary.
It is possible to compress the message even further by finding redundant
patterns and not restricting dictionary entries to only words. The following
table is a modified dictionary containing repetitive text, with hyphens denoting
spaces:
n V alue
1 Life-is-
2 -it.
Table 2-2. Modified compression dictionary.
The new encoded message (C(m)) is:
C(m) = “1an-opportunity,-benefit-from2-1beauty,-admire2”
• File size of C(m) = 47 units
• File size of dictionary = 12 units
• Total compressed file size = 59 units
• File size of original message = 67 units
MIT Undergraduate Journal of Mathematics
Thus, by compressing longer strings of words and omitting single occurrences
of words from the dictionary, the message is compressed further, saving a total
of 8 units of file space.
However, compressing a message using the frequency of appearance of words
does have limitations. For instance, if many words are unrepeated, they will
remain uncompressed. Furthermore, if a message does not contain repeated
words, a dictionary of words would be unable to compress the message. As an
example, consider the following message:
m = “The quick brown fox jumps over a lazy dog.”
The message contains no repeated words and will be left uncompressed by
the compression method using a dictionary of words. Compressing entire words
proves to be difficult, since the dictionary would become large and cumbersome
for a lengthy text file, as there are nearly 200,000 words in current use in the
English language alone. Theoretically, compressing an electronic novel could
require a dictionary containing thousands of words, and still thousands more
words would be left uncompressed. A solution to these limitations would be
to represent each character and letter with binary codes, sized from shortest to
longest according to the frequency of usage in the text. This allows for a suc-
cessful compression of any message, since the most frequently used characters,
which occupy the most space, are denoted by the shortest binary code.
3 Huffman Algorithm
The Huffman Algorithm solves the limitations of the basic method in Section 2
by compressing every single character in a given text. Compression is achieved
by giving each character or letter in the message a binary code. Letters with a
high frequency of appearance are assigned the shortest binary codes since they
use the most file space, and letters that do not show up frequently are given the
longer binary codes.
Let us begin with a simple example. Consider the following message:
m = “ababac”
The file size of the message is determined in binary, therefore the message
“ababc” can be represented in binary code, as each character has a designated
binary number.
m in binary = 01100001 01100010 01100001 01100010 01100001 01100011
After observing the binary form of the message, we can see that the file size
of the original message is 40 bytes long.
In order to compress the message, we can assign the most frequently used
character the shortest binary code, and inversely, assign the least used character
with the longest binary code. For example, the following dictionary in Table
3-1 can be used to compress the original message.
Huffman Data Compression
Character Binary Code
a 1
b 01
c 00
Table 3-1. Huffman Dictionary.
As we can see in Table 3-1, the shortest binary code replaces the most fre-
quently used character, “a”. Also, progressively less frequently used characters
are given longer binary codes. Because frequently used letters receive shorter
codes, the expected message length is at most that of the uncompressed message.
In order to assign a binary code of the appropriate length to each letter, the
Huffman Algorithm uses a binary tree like that in Figure 3-1.
6
0 1
3
0 1
1 2 3
c b a
Figure 3-1. A Huffman binary tree with labeled edges.
The binary code for a character is read starting at the root node and record-
ing the binary number of each edge traversed to reach the leaf node of the
character. For example, for the least frequently appearing letter, “c”, combin-
ing the binary numbers of the edges along the path from the root node to the
leaf node of “c” gives us a binary code of “01”. Also, “a”, the most frequently
used character, has the shortest path, a binary code of “1”.
In order to construct the Huffman binary tree, we must first collect the
frequencies of each character appearing in the text. Each of these frequencies
are the leaf nodes in the Huffman binary tree. The Huffman Algorithm encodes
the message using the following steps:
1. Begin with a queue of the frequencies of the characters sorted from least
to greatest.
2. Begin creating a binary tree by removing the two smallest frequency values
from the queue. Add them to create a parent node.
3. Add the parent node back into the queue.
4. Repeat Steps 1–3 until all of the frequencies have been added up to create
the root node.
MIT Undergraduate Journal of Mathematics
6
3
1 2 3
c b a
Figure 3-2. A Huffman binary tree.
Now, we can assign the left-sided edges as “0” and the right-sided edges as
“1”. Notice that letters appearing least frequently are combined into parent
nodes first, resulting in a longer path from the root node down to the respective
leaf node of the lowest frequency letter. The smaller the frequency of the letter,
the longer the path. This results in longer binary codes for less frequent letters.
Using the Huffman binary tree, we can find the binary codes for each char-
acter, as shown in the dictionary in Table 3-1. Replacing the characters in
the original message with the binary codes in the dictionary gives us a short,
compressed message C(m).
m = “01100001 01100010 01100001 01100010 01100001 01100011”
C(m) = “1 01 1 01 1 00”
• File size of C(m) = 9 bytes
• File size of dictionary = 5 bytes
• Total compressed file size = 14 bytes
• File size of original message = 48 bytes
Even for such a small and simple message, we can see that the Huffman
Algorithm is effective at compression, reducing to less than 30% of its original
file size.
In the following example, a longer and more complex message will be encoded
with the Huffman Algorithm.
Message = “AABBACCACACACADCFACACFCEFCCBBD”
First, the frequency of appearance of each letter of these letters is recorded:
A: 9 C: 11 E: 1 B: 4 D: 2 F: 3
1. Sort the frequencies into a queue from smallest to largest:
C: 11 A: 9 B: 4 F: 3 D: 2 E: 1
2. Remove the two smallest frequencies in the queue and add them to create
a parent node.
3. Add the parent node frequency to the queue and re-sort the frequencies
from smallest to largest. Repeat steps 1-3 until the root node is formed.
Huffman Data Compression
30
11 19
C
9 10
A
4 6
B
3 3
F
1 2
E D
Figure 3-3. The completed binary tree.
Thus, using the Huffman binary tree, we can find the binary codes for each
letter, remembering that every left and right edges are “0” and “1”, respectively.
C: 0 A: 10 B: 110 F: 1111 D: 11101 E: 11100
The original message would have had a 240-byte string of letters, and we
were able to compress the file size to less than 40% of it’s original size with a
simple Huffman binary tree.
• File size of C(m) = 69 bytes
• File size of dictionary = 20 bytes
• Total compressed file size = 89 bytes
• File size of original message = 240 bytes
By simple observation of the Huffman Binary tree from Figure 3-3, two
points can be made in order to generalize the structure of the binary tree:
1. The shortest frequency symbol is assigned the longest code length. And,
the higher frequency symbols have shorter binary code lengths.
2. There are always two longest length codes, because a binary tree is com-
posed of at least two leaves (lowest frequency symbols) adding together to
form a tree.
Although the Huffman Algorithm is very simple and efficient, its level of
compression can be improved upon. Higher orders the Huffman tree can be
implemented in order to have a more efficient compression. While the Huffman
Algorithm combines two frequencies at every step, a higher order Huffman Algo-
rithm is a tree data structure in which three or more frequencies are combined.
These higher level Huffman trees are called ternary Huffman trees, quaternary
Huffman trees, and so on. An example of a ternary Huffman tree applied on
the previous Huffman example is shown below.
MIT Undergraduate Journal of Mathematics
30
9 10 11
A G
3 3 4
F B
0 1 2
Null E D
Figure 3-4. A ternary Huffman tree.
Ternary trees follow the same steps as Huffman binary trees, except three
frequencies are combined at each node. However, the ternary tree cannot be
constructed when the number of nodes is an even number. This problem can
be solved by inserting a null node with zero frequency:
1. If the number of frequency values are even, then add a null node with a
value of “0”.
2. Sort the frequency values from smallest to greatest into a queue.
3. Remove the 3 smallest frequency values from the queue and add them to
create a parent node.
4. Add the parent node frequency back into the queue.
5. Re-sort the frequencies from smallest to largest.
6. Repeat steps 2–5 until the root node is created.
Each node in a ternary Huffman tree has three edges: left, middle, and right.
The left, middle, and right edges can be labeled as the symbols 0, 1, and 2. The
ternary code for each character is shown below:
A: 0 G: 2 F: 11 B: 12 E: 101 D: 102
The message can be compressed by replacing the characters with the ternary
codes. We can see that the compression power of the ternary Huffman algorithm
is greater than that of the Huffman binary algorithm, compressing to a file size
of 55 bytes, in contrast to the compressed size of 89 bytes for the Huffman
binary algorithm.
• File size of C(m) = 43 bytes
• File size of dictionary = 12 bytes
• Total compressed file size = 55 bytes
• File size of original message = 240 bytes
Huffman Data Compression
4 Preservation of Order: Hu-Tucker Algorithm
The Huffman Algorithm is a simple method which does not take into account
the order of the characters in the text because they are sorted with respect
to frequency length. However, the Hu-Tucker Algorithm modifies the Huffman
algorithm so that the entire queue of frequency nodes do not have to be re-sorted
at each step, thus preserving their original order of appearance.
The Hu–Tucker Algorithm minimizes the message length in almost the same
way that the Huffman Algorithm does, but while preserving the order of char-
acters, corresponding to the original ordering of the nodes. Like the Huffman
Algorithm, the Hu–Tucker method also merges the smallest frequency blocks
together, but with different rules based on the positioning of the nodes. This
allows the user to be able to sort the characters in order of appearance in the
original text.
There are two rules to follow when implementing the Hu-Tucker algorithm.
1. Two nodes in the queue can be merged only if there are no nodes between
them.
2. If nodes A and B do not have any nodes between them in the queue and are
the lowest frequency blocks in their vicinity, then they should be merged
into a new node.
Figure 4-1 is a tree created by the Hu-Tucker algorithm. Note that the
smallest, compatible nodes are always merged together before larger ones.
16
7 9
5 2
1 4 2 7 1 1
A B C D E F
Figure 4-1. A Hu-Tucker tree.
The order of appearance is determined by the binary code given to each
character. Like the Huffman binary tree, the left and right edges leading out of
each parent node are labeled as “0” and “1”, respectively. The first appearing
character is symbolized by the binary code with the largest number of zeros and
least number of ones, zeros leading ones. For instance, in our example, “A” is
symbolized by “000” and appears in the text before “E”, which is symbolized
by the code “110”. As further example, a symbol with a code of “01” appears
before a symbol “100”.
5 New Technology using the Huffman Algorithm
Currently, lossless compression utilities such as jpeg and mpeg are widely used
by the general public. However, as the name “lossless” suggests, they compress
MIT Undergraduate Journal of Mathematics
information while allowing a loss of information. However, new compression
programs for videos, images, and text have been designed using the Huffman
Algorithm, which does allow any loss of information. The Huffman Algorithm
is easy to implement for text files, but requires modification to work with image
and video files. Two current compression programs using the Huffman Algo-
rithm are HuffYUV and ABO (Adaptive Binary Optimization). These programs
have gained much attention to their speed of lossless compression, since lossless
compression is widely preferred to lossy compression.
HuffYUV is a video codec known for its speedy compression of videos. It
is somewhat incorrectly named, as it does not perform the Huffman algorithm
on YUV (a color scheme adapted from the traditional RGB, Red/Green/Blue
color space). HuffYUV compresses a color space called “YCbCr” which is used
in video systems, employing brightness components (Y, luma) coupled with red
and blue chroma components (CbCr).
This codec is one of the best in terms of speed and efficiency. Thus, people
watching a video compressed by a HuffYUV (or any other lossless video codec)
can enjoy a high-quality video, while reveling in the fact that the video takes
up little space on the computer.
In terms of video codecs, there are much slower ones other than HuffYUV,
but sometimes they are chosen over HuffYUV for their different functions or
better compression. One popular codec for videos is “Lagarith,” which has a
comparable encoding speed to most other codecs. Lagarith is favorable in terms
of options because it has the capability of supporting most types of color space,
including RGB, RGBA, YV12, and YUY2. Although decoding is much slower
than most codecs, Lagarith makes up for the speed by allow the decoding of
separate video frames, allowing the video to be edited smoothly by cutting and
splicing frames. Huffman codecs are the fastest encoders and decoders to date,
because of the overwhelming simplicity of binary coding.
Image compressors also utilize the Huffman Algorithm, although it is not as
crucial that the speed of compression be extremely fast, since image files are
small in size comparatively to video and music files. Adaptive Binary Optimiza-
tion (ABO) is a variation on the Huffman Algorithm, able to compress image
files without a loss of information.
ABO created waves in the technological world by challenging the jpeg com-
pression process. Created in Singapore, ABO is now a favored image compressor
for hospitals and document processing companies, since it can reduce the file size
of an image without a loss of information in the final output of the decompresser.
MatrixView, the company that developed ABO, is working in partnership
with the KK Women’s and Children’s Hospital in Singapore, helping them to
archive their library of ultrasound images and videos. According to Arvind Thi-
agarajan, MatrixView’s founder, ABO can compress image files up to 32 times
the ratio of JPEG compression, and maintain a lossless stream of information
[3].
With the help of Huffman’s ideas, ABO compression could very well be the
premier technology for videos and images within the next few years. Lossless
compression is the most favorable filing method, and Huffman’s algorithm is
essential in allowing that to happen.
Huffman Data Compression
References
[1] Huffman, D.A., “A method for the construction of minimum-redundancy
codes,” Proceedings of the I.R.E., 1952.
[2] Lii, J., “Finding Efficient Compressions; Huffman and Hu Tucker Algo-
rithms,” 18.310 Lecture Notes, MIT, 2004.
[3] Lui, J., “ABO aims to out-compress JPEG,” ZDNet, 2003.

More Related Content

What's hot

Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Python assignment 4
Python assignment 4Python assignment 4
Python assignment 4ANILBIKAS
 
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
 

What's hot (9)

Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
Br ainfocom94
Br ainfocom94Br ainfocom94
Br ainfocom94
 
NLP and Deep Learning
NLP and Deep LearningNLP and Deep Learning
NLP and Deep Learning
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
Python assignment 4
Python assignment 4Python assignment 4
Python assignment 4
 
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
 

Viewers also liked

kolledj
kolledjkolledj
kolledjcatt62
 
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)Heta Koski
 
Final_Evaluation_RDP
Final_Evaluation_RDPFinal_Evaluation_RDP
Final_Evaluation_RDPLuiza Hoxhaj
 
Case Study Registration for CQC new provider
Case Study  Registration for CQC new providerCase Study  Registration for CQC new provider
Case Study Registration for CQC new providerTaruna Chauhan
 
Projekt Management DIPLOMA
Projekt Management DIPLOMAProjekt Management DIPLOMA
Projekt Management DIPLOMAMarko Saul
 
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...elstonweinhaus
 
Доходы чиновников администрации Краснотурьинска за 2014 год
Доходы чиновников администрации Краснотурьинска за 2014 годДоходы чиновников администрации Краснотурьинска за 2014 год
Доходы чиновников администрации Краснотурьинска за 2014 годАлександр Сударев
 
Class 9 &_10_accounting_chapter two_class 9
Class 9 &_10_accounting_chapter two_class 9Class 9 &_10_accounting_chapter two_class 9
Class 9 &_10_accounting_chapter two_class 9Cambriannews
 
Communication ESO 2
Communication ESO 2Communication ESO 2
Communication ESO 2Cara Donovan
 
01 cisco router tech support number
01 cisco router tech support number01 cisco router tech support number
01 cisco router tech support numberEsther Anne
 

Viewers also liked (12)

kolledj
kolledjkolledj
kolledj
 
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)
Tiedeviestintä on yhteinen asia (CSC yliopistoviestinnän tukena)
 
Final_Evaluation_RDP
Final_Evaluation_RDPFinal_Evaluation_RDP
Final_Evaluation_RDP
 
Case Study Registration for CQC new provider
Case Study  Registration for CQC new providerCase Study  Registration for CQC new provider
Case Study Registration for CQC new provider
 
Projekt Management DIPLOMA
Projekt Management DIPLOMAProjekt Management DIPLOMA
Projekt Management DIPLOMA
 
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...
ACC 491 Week 4 Learning Team Assignment Apollo Shoes Case Assignment (1) 2015...
 
Доходы чиновников администрации Краснотурьинска за 2014 год
Доходы чиновников администрации Краснотурьинска за 2014 годДоходы чиновников администрации Краснотурьинска за 2014 год
Доходы чиновников администрации Краснотурьинска за 2014 год
 
Disaggregation Point of View
Disaggregation Point of ViewDisaggregation Point of View
Disaggregation Point of View
 
Class 9 &_10_accounting_chapter two_class 9
Class 9 &_10_accounting_chapter two_class 9Class 9 &_10_accounting_chapter two_class 9
Class 9 &_10_accounting_chapter two_class 9
 
Communication ESO 2
Communication ESO 2Communication ESO 2
Communication ESO 2
 
01 cisco router tech support number
01 cisco router tech support number01 cisco router tech support number
01 cisco router tech support number
 
Vv
VvVv
Vv
 

Similar to Final3

11.the novel lossless text compression technique using ambigram logic and huf...
11.the novel lossless text compression technique using ambigram logic and huf...11.the novel lossless text compression technique using ambigram logic and huf...
11.the novel lossless text compression technique using ambigram logic and huf...Alexander Decker
 
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataImplementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataBRNSSPublicationHubI
 
Sunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmSunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmDr Sandeep Kumar Poonia
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals
 
Paper id 24201469
Paper id 24201469Paper id 24201469
Paper id 24201469IJRAT
 
Information Theory Final.pptx
Information Theory Final.pptxInformation Theory Final.pptx
Information Theory Final.pptxSkNick1
 
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGE
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGEAUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGE
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGEcscpconf
 
Automatic short mail construction to enhance email storage
Automatic short mail construction to enhance email storageAutomatic short mail construction to enhance email storage
Automatic short mail construction to enhance email storagecsandit
 
GRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPMGRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPMijcseit
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORTDhrumil Shah
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
 
White space steganography on text
White space steganography on textWhite space steganography on text
White space steganography on textIJCNCJournal
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptManimaran A
 

Similar to Final3 (20)

Text compression
Text compressionText compression
Text compression
 
11.the novel lossless text compression technique using ambigram logic and huf...
11.the novel lossless text compression technique using ambigram logic and huf...11.the novel lossless text compression technique using ambigram logic and huf...
11.the novel lossless text compression technique using ambigram logic and huf...
 
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataImplementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text Data
 
Sunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmSunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithm
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
 
Paper id 24201469
Paper id 24201469Paper id 24201469
Paper id 24201469
 
Huffman Text Compression Technique
Huffman Text Compression TechniqueHuffman Text Compression Technique
Huffman Text Compression Technique
 
Ijrdtvlis11 140006
Ijrdtvlis11 140006Ijrdtvlis11 140006
Ijrdtvlis11 140006
 
Information Theory Final.pptx
Information Theory Final.pptxInformation Theory Final.pptx
Information Theory Final.pptx
 
Huffman
HuffmanHuffman
Huffman
 
Huffman
HuffmanHuffman
Huffman
 
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGE
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGEAUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGE
AUTOMATIC SHORT MAIL CONSTRUCTION TO ENHANCE EMAIL STORAGE
 
Automatic short mail construction to enhance email storage
Automatic short mail construction to enhance email storageAutomatic short mail construction to enhance email storage
Automatic short mail construction to enhance email storage
 
Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM
 
GRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPMGRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPM
 
Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPMGrammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
 
White space steganography on text
White space steganography on textWhite space steganography on text
White space steganography on text
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.ppt
 

Final3

  • 1. Huffman Data Compression Joseph S. Lee May 22, 2011 Abstract This paper gives a guide for implementing binary compression algo- rithms through the use of binary trees, and then examines the real-life uses that binary compression offers. 1 Introduction The Huffman Algorithm was created in the early 1950s by David Huffman when he was a PhD student at MIT. At the time, Huffman was a student in an applied mathematics course and was given the option of either writing a research paper on finding the most efficient binary compression code or taking the final exam. Though Huffman decided to research a compression algorithm, he nearly gave up and began studying for his final exam before he discovered that a binary-tree frequency-sorting method was extremely efficient in compressing any given message sequence. Thus, Huffman invented binary compression, a method of repackaging long messages into shorter encoded messages, which can be reassembled into the original data without any loss of information. Huffman published his findings the following year in the Proceedings of the I.R.E. journal [1]. The Huffman Algorithm was revolutionary in producing the most efficient method of compressing any text document. The Huffman Algorithm is a lossless compression method, which signifies that it does not discard any data during the compression process. Although the Huffman Algorithm is crucial for the compression of important text files such as articles or bank records, it has not been incorporated into the compression of video and still images until the intro- duction of HuffYUV and Adaptive Binary Optimization (ABO), which employ the Huffman Algorithm in the lossless compression of video and image files. In Section 2, we illustrate some basic concepts of how compression works using a simple dictionary method. In Section 3, we show how to implement the Huffman Algorithm, giving examples of encoding a string of characters into compressed text. In Section 4, we give an example of a modification of the Huff- man Algorithm, the Hu-Tucker Algorithm, as demonstrated by Lii [2]. Finally, in Section 5, we explore how the Huffman Algorithm is implemented in modern compression programs for videos, photos, and documents.
  • 2. MIT Undergraduate Journal of Mathematics 2 Basic Compression Text compression is achieved by eliminating repetitive words in the text and replacing them with a representive symbol or number. Since text-based data becomes very large and cumbersome, compressing repeated words and redun- dant material can greatly reduce the file size. File size is the measurement of the space a file needs on a computer and is measured in bytes. In this section, the compression method is a simple dictionary-based compression system. The dictionary, the key to decoding the compressed file, holds the repeated word in order of appearance. In the compressed, or encoded, text, the repeated words will be symbolized by its number in the dictionary. For example, consider the following message m: m = “Life is an opportunity, benefit from it. Life is beauty, admire it.” This text contains 12 words, made up of 52 letters, 11 spaces, and 4 punc- tuation marks. The total file size of this text is 67 characters long. However, compression can be achieved by grouping redundant words together. Repeated words can be easily listed after a quick scan over the text. 1. “life” appears two times; 2. “is” appears two times; 3. “an” appears one time; 4. “opportunity” appears one time; 5. “benefit” appears one time; 6. “from” appears one time; 7. “it” appears two times; 8. “beauty” appears one time; 9. “admire” appears one time. By using this list of words and their frequency of appearance in the message, the text can be compressed by creating a dictionary out of these words. The dictionary for the message is shown in Table 2-1. The file size of the dictionary is determined by the sum of the length of each word contained in the dictionary. The encoded message, C(m), reads as follows, with hyphens signifying spaces: C(m) = “1-2-3-4,-5-6-7. 1-2-8,-9-7.” Using the compression dictionary, the encoded message can be easily decoded to reveal the original message. A decompresser utility simply recreates the original message by quickly substituting for the encoded values the original values given in the dictionary. The effectiveness of compression can be seen through the comparison of file sizes. The encoded file size includes the dictionary and the encoded message itself. Without the dictionary, the encoded message cannot be expanded into the original message.
  • 3. Huffman Data Compression n V alue 1 life 2 is 3 an 4 opportunity 5 benefit 6 from 7 it 8 beauty 9 admire Table 2-1. Compression dictionary. • File size of C(m) = 27 units • File size of dictionary = 44 units • Total compressed file size = 71 units • Original file size = 67 units The compressed file size shows that the encoded message is actually larger than the original message. However, in Table 2.1, the dictionary even shows words that are only used once in the text. For better compression, the words that only appear once should remain uncompressed in the message, as they would demand unnecessary space in the dictionary. It is possible to compress the message even further by finding redundant patterns and not restricting dictionary entries to only words. The following table is a modified dictionary containing repetitive text, with hyphens denoting spaces: n V alue 1 Life-is- 2 -it. Table 2-2. Modified compression dictionary. The new encoded message (C(m)) is: C(m) = “1an-opportunity,-benefit-from2-1beauty,-admire2” • File size of C(m) = 47 units • File size of dictionary = 12 units • Total compressed file size = 59 units • File size of original message = 67 units
  • 4. MIT Undergraduate Journal of Mathematics Thus, by compressing longer strings of words and omitting single occurrences of words from the dictionary, the message is compressed further, saving a total of 8 units of file space. However, compressing a message using the frequency of appearance of words does have limitations. For instance, if many words are unrepeated, they will remain uncompressed. Furthermore, if a message does not contain repeated words, a dictionary of words would be unable to compress the message. As an example, consider the following message: m = “The quick brown fox jumps over a lazy dog.” The message contains no repeated words and will be left uncompressed by the compression method using a dictionary of words. Compressing entire words proves to be difficult, since the dictionary would become large and cumbersome for a lengthy text file, as there are nearly 200,000 words in current use in the English language alone. Theoretically, compressing an electronic novel could require a dictionary containing thousands of words, and still thousands more words would be left uncompressed. A solution to these limitations would be to represent each character and letter with binary codes, sized from shortest to longest according to the frequency of usage in the text. This allows for a suc- cessful compression of any message, since the most frequently used characters, which occupy the most space, are denoted by the shortest binary code. 3 Huffman Algorithm The Huffman Algorithm solves the limitations of the basic method in Section 2 by compressing every single character in a given text. Compression is achieved by giving each character or letter in the message a binary code. Letters with a high frequency of appearance are assigned the shortest binary codes since they use the most file space, and letters that do not show up frequently are given the longer binary codes. Let us begin with a simple example. Consider the following message: m = “ababac” The file size of the message is determined in binary, therefore the message “ababc” can be represented in binary code, as each character has a designated binary number. m in binary = 01100001 01100010 01100001 01100010 01100001 01100011 After observing the binary form of the message, we can see that the file size of the original message is 40 bytes long. In order to compress the message, we can assign the most frequently used character the shortest binary code, and inversely, assign the least used character with the longest binary code. For example, the following dictionary in Table 3-1 can be used to compress the original message.
  • 5. Huffman Data Compression Character Binary Code a 1 b 01 c 00 Table 3-1. Huffman Dictionary. As we can see in Table 3-1, the shortest binary code replaces the most fre- quently used character, “a”. Also, progressively less frequently used characters are given longer binary codes. Because frequently used letters receive shorter codes, the expected message length is at most that of the uncompressed message. In order to assign a binary code of the appropriate length to each letter, the Huffman Algorithm uses a binary tree like that in Figure 3-1. 6 0 1 3 0 1 1 2 3 c b a Figure 3-1. A Huffman binary tree with labeled edges. The binary code for a character is read starting at the root node and record- ing the binary number of each edge traversed to reach the leaf node of the character. For example, for the least frequently appearing letter, “c”, combin- ing the binary numbers of the edges along the path from the root node to the leaf node of “c” gives us a binary code of “01”. Also, “a”, the most frequently used character, has the shortest path, a binary code of “1”. In order to construct the Huffman binary tree, we must first collect the frequencies of each character appearing in the text. Each of these frequencies are the leaf nodes in the Huffman binary tree. The Huffman Algorithm encodes the message using the following steps: 1. Begin with a queue of the frequencies of the characters sorted from least to greatest. 2. Begin creating a binary tree by removing the two smallest frequency values from the queue. Add them to create a parent node. 3. Add the parent node back into the queue. 4. Repeat Steps 1–3 until all of the frequencies have been added up to create the root node.
  • 6. MIT Undergraduate Journal of Mathematics 6 3 1 2 3 c b a Figure 3-2. A Huffman binary tree. Now, we can assign the left-sided edges as “0” and the right-sided edges as “1”. Notice that letters appearing least frequently are combined into parent nodes first, resulting in a longer path from the root node down to the respective leaf node of the lowest frequency letter. The smaller the frequency of the letter, the longer the path. This results in longer binary codes for less frequent letters. Using the Huffman binary tree, we can find the binary codes for each char- acter, as shown in the dictionary in Table 3-1. Replacing the characters in the original message with the binary codes in the dictionary gives us a short, compressed message C(m). m = “01100001 01100010 01100001 01100010 01100001 01100011” C(m) = “1 01 1 01 1 00” • File size of C(m) = 9 bytes • File size of dictionary = 5 bytes • Total compressed file size = 14 bytes • File size of original message = 48 bytes Even for such a small and simple message, we can see that the Huffman Algorithm is effective at compression, reducing to less than 30% of its original file size. In the following example, a longer and more complex message will be encoded with the Huffman Algorithm. Message = “AABBACCACACACADCFACACFCEFCCBBD” First, the frequency of appearance of each letter of these letters is recorded: A: 9 C: 11 E: 1 B: 4 D: 2 F: 3 1. Sort the frequencies into a queue from smallest to largest: C: 11 A: 9 B: 4 F: 3 D: 2 E: 1 2. Remove the two smallest frequencies in the queue and add them to create a parent node. 3. Add the parent node frequency to the queue and re-sort the frequencies from smallest to largest. Repeat steps 1-3 until the root node is formed.
  • 7. Huffman Data Compression 30 11 19 C 9 10 A 4 6 B 3 3 F 1 2 E D Figure 3-3. The completed binary tree. Thus, using the Huffman binary tree, we can find the binary codes for each letter, remembering that every left and right edges are “0” and “1”, respectively. C: 0 A: 10 B: 110 F: 1111 D: 11101 E: 11100 The original message would have had a 240-byte string of letters, and we were able to compress the file size to less than 40% of it’s original size with a simple Huffman binary tree. • File size of C(m) = 69 bytes • File size of dictionary = 20 bytes • Total compressed file size = 89 bytes • File size of original message = 240 bytes By simple observation of the Huffman Binary tree from Figure 3-3, two points can be made in order to generalize the structure of the binary tree: 1. The shortest frequency symbol is assigned the longest code length. And, the higher frequency symbols have shorter binary code lengths. 2. There are always two longest length codes, because a binary tree is com- posed of at least two leaves (lowest frequency symbols) adding together to form a tree. Although the Huffman Algorithm is very simple and efficient, its level of compression can be improved upon. Higher orders the Huffman tree can be implemented in order to have a more efficient compression. While the Huffman Algorithm combines two frequencies at every step, a higher order Huffman Algo- rithm is a tree data structure in which three or more frequencies are combined. These higher level Huffman trees are called ternary Huffman trees, quaternary Huffman trees, and so on. An example of a ternary Huffman tree applied on the previous Huffman example is shown below.
  • 8. MIT Undergraduate Journal of Mathematics 30 9 10 11 A G 3 3 4 F B 0 1 2 Null E D Figure 3-4. A ternary Huffman tree. Ternary trees follow the same steps as Huffman binary trees, except three frequencies are combined at each node. However, the ternary tree cannot be constructed when the number of nodes is an even number. This problem can be solved by inserting a null node with zero frequency: 1. If the number of frequency values are even, then add a null node with a value of “0”. 2. Sort the frequency values from smallest to greatest into a queue. 3. Remove the 3 smallest frequency values from the queue and add them to create a parent node. 4. Add the parent node frequency back into the queue. 5. Re-sort the frequencies from smallest to largest. 6. Repeat steps 2–5 until the root node is created. Each node in a ternary Huffman tree has three edges: left, middle, and right. The left, middle, and right edges can be labeled as the symbols 0, 1, and 2. The ternary code for each character is shown below: A: 0 G: 2 F: 11 B: 12 E: 101 D: 102 The message can be compressed by replacing the characters with the ternary codes. We can see that the compression power of the ternary Huffman algorithm is greater than that of the Huffman binary algorithm, compressing to a file size of 55 bytes, in contrast to the compressed size of 89 bytes for the Huffman binary algorithm. • File size of C(m) = 43 bytes • File size of dictionary = 12 bytes • Total compressed file size = 55 bytes • File size of original message = 240 bytes
  • 9. Huffman Data Compression 4 Preservation of Order: Hu-Tucker Algorithm The Huffman Algorithm is a simple method which does not take into account the order of the characters in the text because they are sorted with respect to frequency length. However, the Hu-Tucker Algorithm modifies the Huffman algorithm so that the entire queue of frequency nodes do not have to be re-sorted at each step, thus preserving their original order of appearance. The Hu–Tucker Algorithm minimizes the message length in almost the same way that the Huffman Algorithm does, but while preserving the order of char- acters, corresponding to the original ordering of the nodes. Like the Huffman Algorithm, the Hu–Tucker method also merges the smallest frequency blocks together, but with different rules based on the positioning of the nodes. This allows the user to be able to sort the characters in order of appearance in the original text. There are two rules to follow when implementing the Hu-Tucker algorithm. 1. Two nodes in the queue can be merged only if there are no nodes between them. 2. If nodes A and B do not have any nodes between them in the queue and are the lowest frequency blocks in their vicinity, then they should be merged into a new node. Figure 4-1 is a tree created by the Hu-Tucker algorithm. Note that the smallest, compatible nodes are always merged together before larger ones. 16 7 9 5 2 1 4 2 7 1 1 A B C D E F Figure 4-1. A Hu-Tucker tree. The order of appearance is determined by the binary code given to each character. Like the Huffman binary tree, the left and right edges leading out of each parent node are labeled as “0” and “1”, respectively. The first appearing character is symbolized by the binary code with the largest number of zeros and least number of ones, zeros leading ones. For instance, in our example, “A” is symbolized by “000” and appears in the text before “E”, which is symbolized by the code “110”. As further example, a symbol with a code of “01” appears before a symbol “100”. 5 New Technology using the Huffman Algorithm Currently, lossless compression utilities such as jpeg and mpeg are widely used by the general public. However, as the name “lossless” suggests, they compress
  • 10. MIT Undergraduate Journal of Mathematics information while allowing a loss of information. However, new compression programs for videos, images, and text have been designed using the Huffman Algorithm, which does allow any loss of information. The Huffman Algorithm is easy to implement for text files, but requires modification to work with image and video files. Two current compression programs using the Huffman Algo- rithm are HuffYUV and ABO (Adaptive Binary Optimization). These programs have gained much attention to their speed of lossless compression, since lossless compression is widely preferred to lossy compression. HuffYUV is a video codec known for its speedy compression of videos. It is somewhat incorrectly named, as it does not perform the Huffman algorithm on YUV (a color scheme adapted from the traditional RGB, Red/Green/Blue color space). HuffYUV compresses a color space called “YCbCr” which is used in video systems, employing brightness components (Y, luma) coupled with red and blue chroma components (CbCr). This codec is one of the best in terms of speed and efficiency. Thus, people watching a video compressed by a HuffYUV (or any other lossless video codec) can enjoy a high-quality video, while reveling in the fact that the video takes up little space on the computer. In terms of video codecs, there are much slower ones other than HuffYUV, but sometimes they are chosen over HuffYUV for their different functions or better compression. One popular codec for videos is “Lagarith,” which has a comparable encoding speed to most other codecs. Lagarith is favorable in terms of options because it has the capability of supporting most types of color space, including RGB, RGBA, YV12, and YUY2. Although decoding is much slower than most codecs, Lagarith makes up for the speed by allow the decoding of separate video frames, allowing the video to be edited smoothly by cutting and splicing frames. Huffman codecs are the fastest encoders and decoders to date, because of the overwhelming simplicity of binary coding. Image compressors also utilize the Huffman Algorithm, although it is not as crucial that the speed of compression be extremely fast, since image files are small in size comparatively to video and music files. Adaptive Binary Optimiza- tion (ABO) is a variation on the Huffman Algorithm, able to compress image files without a loss of information. ABO created waves in the technological world by challenging the jpeg com- pression process. Created in Singapore, ABO is now a favored image compressor for hospitals and document processing companies, since it can reduce the file size of an image without a loss of information in the final output of the decompresser. MatrixView, the company that developed ABO, is working in partnership with the KK Women’s and Children’s Hospital in Singapore, helping them to archive their library of ultrasound images and videos. According to Arvind Thi- agarajan, MatrixView’s founder, ABO can compress image files up to 32 times the ratio of JPEG compression, and maintain a lossless stream of information [3]. With the help of Huffman’s ideas, ABO compression could very well be the premier technology for videos and images within the next few years. Lossless compression is the most favorable filing method, and Huffman’s algorithm is essential in allowing that to happen.
  • 11. Huffman Data Compression References [1] Huffman, D.A., “A method for the construction of minimum-redundancy codes,” Proceedings of the I.R.E., 1952. [2] Lii, J., “Finding Efficient Compressions; Huffman and Hu Tucker Algo- rithms,” 18.310 Lecture Notes, MIT, 2004. [3] Lui, J., “ABO aims to out-compress JPEG,” ZDNet, 2003.