3. Huffman Coding
⢠A technique to compress data effectively
⢠Usually between 20%-90% compression
⢠Lossless compression
⢠No information is lost
⢠When decompress, you get the original file
Original file
Compressed file
Huffman coding
4. Huffman Coding: Applications
⢠Saving space
⢠Store compressed files instead of original files
⢠Transmitting files or data
⢠Send compressed data to save transmission time and power
⢠Encryption and decryption
⢠Cannot read the compressed file without knowing the âkeyâ
Original file
Compressed file
Huffman coding
5. Main Idea: Frequency-Based Encoding
⢠Assume in this file only 6 characters appear
⢠E, A, C, T, K, N
⢠The frequencies are:
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100
⢠Option I (No Compression)
â Each character = 1 Byte (8 bits)
â Total file size = 14,700 * 8 = 117,600 bits
⢠Option 2 (Fixed size compression)
â We have 6 characters, so we need
3 bits to encode them
â Total file size = 14,700 * 3 = 44,100 bits
Character Fixed Encoding
E 000
A 001
C 010
T 100
K 110
N 111
6. Main Idea: Frequency-Based Encoding
(Contâd)
⢠Assume in this file only 6 characters appear
⢠E, A, C, T, K, N
⢠The frequencies are:
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100⢠Option 3 (Huffman compression)
â Variable-length compression
â Assign shorter codes to more frequent characters and longer
codes to less frequent characters
â Total file size:
Char. HuffmanEncoding
E 0
A 10
C 110
T 1110
K 11110
N 11111
(10,000 x 1) + (4,000 x 2) + (300 x 3) + (200 x 4) + (100 x
5) + (100 x 5) = 20,700 bits
7. Huffman Coding
⢠A variable-length coding for characters
⢠More frequent characters ď¨ shorter codes
⢠Less frequent characters ď¨ longer codes
⢠It is not like ASCII coding where all characters have the same coding
length (8 bits)
⢠Two main questions
⢠How to assign codes (Encoding process)?
⢠How to decode (from the compressed file, generate the original file)
(Decoding process)?
8. Decoding for fixed-length codes is much
easier
Character Fixed-length
Encoding
E 000
A 001
C 010
T 100
K 110
N 111
010001100110111000
010 001 100 110 111 000
Divide into 3âs
C A T K N E
Decode
9. Decoding for variable-length codes is not
that easyâŚ
Character Variable-length
Encoding
E 0
A 00
C 001
⌠âŚ
⌠âŚ
⌠âŚ
000001
It means what???
EEEC EAC AEC
Huffman encoding guarantees to avoid this
uncertainty âŚAlways have a single decoding
10. Huffman Algorithm
⢠Step 1: Get Frequencies
⢠Scan the file to be compressed and count the occurrence of each character
⢠Sort the characters based on their frequency
⢠Step 2: Build Tree & Assign Codes
⢠Build a Huffman-code tree (binary tree)
⢠Traverse the tree to assign codes
⢠Step 3: Encode (Compress)
⢠Scan the file again and replace each character by its code
⢠Step 4: Decode (Decompress)
⢠Huffman tree is the key to decompress the file
11. Step 1: Get Frequencies
Eerie eyes seen near lake.
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
Input File:
12. Step 2: Build Huffman Tree & Assign Codes
⢠It is a binary tree in which each character is a leaf node
⢠Initially each node is a separate root
⢠At each step
⢠Select two roots with smallest frequency and connect them to a new
parent (Break ties arbitrary) [The greedy choice]
⢠The parent will get the sum of frequencies of the two child nodes
⢠Repeat until you have one root
13. Example
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
â
4
e
8
Each char. has a
leaf node with its
frequency
14. Find the smallest two frequenciesâŚReplace them with their parent
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
â
4
e
8
E
1
i
1
2
26. Lets Analyze Huffman Tree
⢠All characters are at the leaf nodes
⢠The number at the root = # of characters in the file
⢠High-frequency chars (E.g., âeâ) are near the root
⢠Low-frequency chars are far from the root
E
1
i
1
â
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
27. Lets Assign Codes
⢠Traverse the tree
⢠Any left edge ď¨ add label 0
⢠As right edge ď¨ add label 1
⢠The code for each character is its root-to-leaf label sequence
E
1
i
1
â
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
28. ⢠Traverse the tree
⢠Any left edge ď¨ add label 0
⢠As right edge ď¨ add label 1
⢠The code for each character is its root-to-leaf label sequence
E
1
i
1
â
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
0
1
0
0
0
0
0
0 0
1
1
11
1
1
1
10
001 1
Lets Assign Codes
29. ⢠Traverse the tree
⢠Any left edge ď¨ add label 0
⢠As right edge ď¨ add label 1
⢠The code for each character is its root-to-leaf label sequence
Lets Assign Codes
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
spaceâ 011
e 10
r 1100
s 1101
n 1110
a 1111
Coding Table
30. Huffman Algorithm
⢠Step 1: Get Frequencies
⢠Scan the file to be compressed and count the occurrence of each character
⢠Sort the characters based on their frequency
⢠Step 2: Build Tree & Assign Codes
⢠Build a Huffman-code tree (binary tree)
⢠Traverse the tree to assign codes
⢠Step 3: Encode (Compress)
⢠Scan the file again and replace each character by its code
⢠Step 4: Decode (Decompress)
⢠Huffman tree is the key to decompess the file
31. Generate the
encoded file
Step 3: Encode (Compress) The File
Eerie eyes seen near lake.
Input File:
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
spaceâ 011
e 10
r 1100
s 1101
n 1110
a 1111
Coding Table
+
000010 1100 000110 âŚ.
Notice that no code is prefix to any other code ď¨
Ensures the decoding will be unique (Unlike Slide 8)
32. Step 4: Decode (Decompress)
⢠Must have the encoded file + the coding tree
⢠Scan the encoded file
⢠For each 0 ď¨ move left in the tree
⢠For each 1 ď¨ move right
⢠Until reach a leaf node ď¨ Emit that character and go back to the root
000010 1100 000110 âŚ.
+
Eerie âŚ
Generate the
original file
33. Huffman Algorithm
⢠Step 1: Get Frequencies
⢠Scan the file to be compressed and count the occurrence of each character
⢠Sort the characters based on their frequency
⢠Step 2: Build Tree & Assign Codes
⢠Build a Huffman-code tree (binary tree)
⢠Traverse the tree to assign codes
⢠Step 3: Encode (Compress)
⢠Scan the file again and replace each character by its code
⢠Step 4: Decode (Decompress)
⢠Huffman tree is the key to decompess the file
34. The Algorithm
⢠An appropriate data structure is a binary min-heap
⢠Rebuilding the heap is lgn and n-1 extractions are made, so the complexity is
O( nlgn)
⢠The encoding is NOT unique, other encoding may work just as well, but none
will work better