2. Text Compression
On a computer: changing the representation
of a file so that it takes less space to store
or/and less time to transmit.
– original file can be reconstructed exactly from the
compressed representation.
different than data compression in general
– text compression has to be lossless.
– compare with sound and images: small changes
and noise is tolerated.
2
3. First Approach
Let the word ABRACADABRA
What is the most economical way to write this
string in a binary representation?
Generally speaking, if a text consists of N
different characters, we need ⎡log N ⎤ bits to
⎢ ⎥
represent each one using a fixed-length
encoding.
Thus, it would require 3 bits for each of 5
different letters, or 33 bits for 11 letters.
Can we do it better?
3
4. Yes!!!!
We can do better, provided:
– Some characters are more frequent than others.
– Characters may be different bit lengths, so that for
example, in the English alphabet letter a may use
only one or two bits, while letter y may use
several.
– We have a unique way of decoding the bit stream.
4
5. Using Variable-Length Encoding (1)
Magic word: ABRACADABRA
LET A = 0
B = 100
C = 1010
D = 1011
R = 11
Thus, ABRACADABRA = 01001101010010110100110
So 11 letters demand 23 bits < 33 bits, an
improvement of about 30%.
5
6. Using Variable-Length Encoding (2)
However, there is a serious danger: How to ensure
unique reconstruction?
Let A 01 and B 0101
How to decode 010101?
AB?
BA?
AAA?
No problem…
if we use prefix codes: no codeword is a prefix of
another codeword.
6
7. Prefix Codes (1)
Any prefix code can be represented by a full
binary tree.
Each leaf stores a symbol.
Each node has two children – left branch
means 0, right means 1.
codeword = path from the root to the leaf
interpreting suitably the left and right
branches.
7
8. Prefix Codes (2)
ABRACADABRA
A=0
B = 100
C = 1010
D = 1011
R = 11
Decoding is unique and simple!
Read the bit stream from left to
right and starting from the root,
whenever a leaf is reached,
write down its symbol and
return to the root.
8
9. Prefix Codes (3)
Let fi the frequency of the i-th symbol ,
di the number of bits required for the i-th
symbol(=the depth of this symbol in tree), 1 ≤ i ≤ n
How do we find the optimal coding tree, n
which minimizes the cost of tree C = ∑ f d ?
i=1
i i
– Frequent characters should have short
codewords
– Rare characters should have long codewords
9
10. Huffman’s Idea
From the previous definition of the cost of tree, it is clear that
the two symbols with the smallest frequencies must be at the
bottom of the optimal tree, as children of the lowest internal
node, isn’t it?
This is a good sign that we have to use a bottom-up manner to
build the optimal code!
Huffman’s idea is based on a greedy approach, using the
previous notices.
Repeat until all nodes merged into one tree:
– Remove two nodes with the lowest frequencies.
– Create a new internal node, with the two just-removed nodes as
children (either node can be either child) and the sum of their
frequencies as the new frequency.
10
11. Constructing a Huffman Code (1)
Assume that frequencies of symbols are:
– A: 40 B: 20 C: 10 D: 10 R: 20
Smallest numbers are 10 and 10 (C and D), so
connect them
11
12. Constructing a Huffman Code (2)
C and D have already been
used, and the new node
above them (call it C+D) has
value 20
The smallest values are B,
C+D, and R, all of which
have value 20
– Connect any two of these
It is clear that the algorithm
does not construct a unique
tree, but even if we have
chosen the other possible
connection, the code would
be optimal too!
12
13. Constructing a Huffman Code (3)
The smallest value is R, while A and B+C+D have
value 40.
Connect R to either of the others.
13
14. Constructing a Huffman Code(4)
Connect the final two nodes, adding 0 and 1 to
each left and right branch respectively.
14
15. Algorithm
X is the set of symbols, whose
frequencies are known in advance
Q is a min-priority queue,
implemented as binary-heap
-1
15
16. What about Complexity?
Thus, the algorithm needs Ο(nlogn)
needs O(nlogn)
-1 Thus, the loop needs O(nlogn)
needs O(logn)
needs O(logn)
needs O(logn)
16
17. Algorithm’s Correctness
It is proven that the greedy algorithm HUFFMAN is correct, as the
problem of determining an optimal prefix code exhibits the greedy-
choice and optimal-substructure properties.
Greedy Choice :Let C an alphabet in which each character c Є C has
frequency f[c]. Let x and y two characters in C having the lowest
frequencies. Then there exists an optimal prefix code for C in which
the codewords for x and y have the same length and differ only in the
last bit.
Optimal Substructure :Let C a given alphabet with frequency f[c]
defined for each character c Є C . Let x and y, two characters in C with
minimum frequency. Let C’ ,the alphabet C with characters x,y
removed and (new) character z added, so that C’ = C – {x,y} U {z};
define f for C’ as for C, except that f[z] = f[x] + f[y]. Let T’ ,any tree
representing an optimal prefix code for the alphabet C’. Then the tree
T, obtained from T’ by replacing the leaf node for z with an internal
node having x and y as children, represents an optimal prefix code for
the alphabet C.
17
18. Last Remarks
• "Huffman Codes" are widely used applications that
involve the compression and transmission of digital
data, such as: fax machines, modems, computer
networks.
• Huffman encoding is practical if:
– The encoded string is large relative to the code table
(because you have to include the code table in the entire
message, if it is not widely spread).
– We agree on the code table in advance
• For example, it’s easy to find a table of letter frequencies for
English (or any other alphabet-based language)
18