2. Huffman Coding
2
• Developed in 1952
• Research paper for MIT
Used in digital data transmissions
– Fax machines
– Modems
– Computer networks
– Video compression
3. “This is gopher”
3
• In binary it would look like this
– “11001111101111010000011001111101111
01000001100111110111111100001101000110
010111100101110011”
– That is 14 characters 8 bits in length or
– 14 * 8 = 112 bits to send the message
4. Introduction
4
• Huffman coding is a compression technique.
• In normal Not all characters occur with the
same frequency!
• Yet all characters are allocated the same amount
of space
– 1 char = 1 byte, be it e or x
5. The Basic Algorithm
5
• Code word lengths are no longer fixed like
ASCII
• Code word lengths vary and will be shorter for
the more frequently used characters
6. The Basic Algorithm
6
1. Scan text to be compressed and tally
occurrence of all characters.
2. Sort or prioritize characters based on
number of occurrences in text.
3. Build Huffman code tree based on
prioritized list.
4. Perform a traversal of tree to determine
all code words.
5. Scan text again and create new file
using the Huffman codes
7. Building a Tree
Scan the original text
7
• Consider the following short text
Eerie eyes seen near lake.
• Count up the occurrences of all characters in the
text
8. Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What characters are present?
E e r i space
y s n a l k .
8
9. Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each character in the
text?
Char Freq. Char Freq. Char Freq.
k
.
1
1
E 1
e 8
r 2
i 1
space 4
y
s
n
a
l
1
2
2
2
1
9
10. Building a Tree
10
Prioritize characters
• Create binary tree nodes with character and
frequency of each character
• Place nodes in a priority queue
– The lower the occurrence, the higher the priority
in the queue
11. Building a Tree
• The queue after inserting all nodes
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
11
33. Building a Tree
e
8
33
2 2
E i y l
2
k .
sp
4
1 1 1 1 1 1 r s
2 2
4
n a
2 2
4
4
6
8
10
16
26
•After
enqueueing
this node
there is only
one node left
in priority
queue.
34. Building a Tree
Dequeue the single node
left in the queue.
This tree contains the
new code words for each
character.
sp
4
34
e
8
2 2 2
E i y l k .
1 1 1 1 1 1 r s
2 2
4
n a
2 2
4
4
6 8
10
16
26
Frequency of root node
should equal number of
characters in text.
Eerie eyes seen near lake. 26 characters
35. Encoding the File
Traverse Tree for Codes
• Perform a traversal of the tree
to obtain new code words
• Going left is a 0 going right is a
1
• code word is only completed
when a leaf node is reached
sp
4
35
e
8
2 2
E i y l k .
1 1 1 1 1 1
2
r s
2 2
4
n a
2 2
4
4
6 8
10
16
26
36. Encoding the File
Traverse Tree for Codes
Code
0000
0001
0010
0011
0100
0101
Char
E
i
y
l
k
.
space 011
e
r
s
n
a
10
1100
1101
1110
1111
sp
4
36
e
8
2 2
E i y l k .
1 1 1 1 1 1
2
r s
2 2
4
n a
2 2
4
4
6 8
10
16
26
37. Encoding the File
• Rescan text and encode file
using new code words
Eerie eyes seen near lake.
Char
E
i
y
l
k
.
space
e
r
s
n
a
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111
0000101100000110011
1000101011011010011
1110101111110001100
1111110100100101
• Why is there no need for a
separator character?
.
37
38. Encoding the File
• ASCII would take 8 * 26 =
208 bits
Results
• 73 bits to encode the text
0000101100000110011
38
1000101011011010011
1110101111110001100
1111110100100101
If 5 bits per character are needed.
Total bits
5 * 26 = 130. Savings not as great.
39. Decoding the File
• How does the receiver know what the
codes are?
• Once receiver has tree it scans incoming
bit stream
• 0 go left
• 1 go right
sp
4
e
8
2 2 2
E i y l k .
1 1 1 1 1 1 r s
2 2
4
n a
2 2
4
4
6 8
10
16
26
0000101100000110011
1000101011011010011
1110101111110001100
1111110100100101
39
44. Huffman Coding Example-1
44
• 4- Repeat this step until there is only one
tree:
Choosetwotreeswiththesmallestweights,callthese
trees T1 and T2. Create a new tree whose root has a weight
equal to the sum of the weights T1 + T2 and whose left
subtree is T1 and whose right subtree is T2.
• 5- The single tree left after the previous step is an
optimal encoding tree.
51. Huffman Coding Example-2
51
• Character (or symbol) frequencies
– A: 20% (.20)
• e.g., ‘A’occurs 20 times in a 100 character document, 1000
times in a 5000 character document, etc.
– B: 9% (.09)
– C: 15% (.15)
– D: 11% (.11)
– E: 40% (.40)
– F: 5% (.05)
• Also works if you use character counts
• Must know frequency of every characters in the
document
52. A C
.20 .15
D
.11
F
.05
B
.09
E
.4
• Here are the symbols and their associated frequencies.
• Now we combine the two least common symbols (those
with the smallest frequencies) to make a new symbol
string and corresponding frequency.
Huffman Coding Example-2
52
55. 55
– C: 011
– D: 001
– E: 1
– F: 0001
• Note
– None are prefixes of another
• Decode this.
– 0100111100010000
– What’s the first character? The second?
ABCDEF
1.0
E
.4
C
.20 .15
D
.11
AC
.35
• Now we assign 0s/1s to each branch
• Codes (reading from top to bottom)
– A: 010
– B: 0000
BFD
.25
0
ABCDF
.6
0
A
0
0
BF
.14
0
B
.09
1
1
1
1
1
F
.05
Huffman Coding Example
Try decoding right
to left
56. Huffman Coding : Limitations
56
• Knowledge of source statistics is rarely available
in practice:
– In file compression, files can be from a wide variety
of applications
– Each source exhibits different statistics
• Needs a source coding algorithm that does not
depend on source statistics!
58. Last words
58
• Best algorithms compress text to 75% of
original size, but humans can compress to 10%
• Humans have far better modeling algorithms
because they have better pattern recognition and
higher-level patterns to recognize
• Intelligence = pattern recognition = data
compression?