Greedy Algorithms
(Huffman Coding)
Prepared by
Md. Shafiuzzaman
Lecturer, Dept. of CSE, JUST
David Huffman’s idea
• A Term paper at MIT
• Build the tree (code) bottom-up in a greedy fashion
Huffman Coding
• A technique to compress data effectively
• Usually between 20%-90% compression
• Lossless compression
• No information is lost
• When decompress, you get the original file
Original file
Compressed file
Huffman coding
Huffman Coding: Applications
• Saving space
• Store compressed files instead of original files
• Transmitting files or data
• Send compressed data to save transmission time and power
• Encryption and decryption
• Cannot read the compressed file without knowing the “key”
Original file
Compressed file
Huffman coding
Main Idea: Frequency-Based Encoding
• Assume in this file only 6 characters appear
• E, A, C, T, K, N
• The frequencies are:
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100
• Option I (No Compression)
– Each character = 1 Byte (8 bits)
– Total file size = 14,700 * 8 = 117,600 bits
• Option 2 (Fixed size compression)
– We have 6 characters, so we need
3 bits to encode them
– Total file size = 14,700 * 3 = 44,100 bits
Character Fixed Encoding
E 000
A 001
C 010
T 100
K 110
N 111
Main Idea: Frequency-Based Encoding
(Cont’d)
• Assume in this file only 6 characters appear
• E, A, C, T, K, N
• The frequencies are:
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100• Option 3 (Huffman compression)
– Variable-length compression
– Assign shorter codes to more frequent characters and longer
codes to less frequent characters
– Total file size:
Char. HuffmanEncoding
E 0
A 10
C 110
T 1110
K 11110
N 11111
(10,000 x 1) + (4,000 x 2) + (300 x 3) + (200 x 4) + (100 x
5) + (100 x 5) = 20,700 bits
Huffman Coding
• A variable-length coding for characters
• More frequent characters  shorter codes
• Less frequent characters  longer codes
• It is not like ASCII coding where all characters have the same coding
length (8 bits)
• Two main questions
• How to assign codes (Encoding process)?
• How to decode (from the compressed file, generate the original file)
(Decoding process)?
Decoding for fixed-length codes is much
easier
Character Fixed-length
Encoding
E 000
A 001
C 010
T 100
K 110
N 111
010001100110111000
010 001 100 110 111 000
Divide into 3’s
C A T K N E
Decode
Decoding for variable-length codes is not
that easy…
Character Variable-length
Encoding
E 0
A 00
C 001
… …
… …
… …
000001
It means what???
EEEC EAC AEC
Huffman encoding guarantees to avoid this
uncertainty …Always have a single decoding
Huffman Algorithm
• Step 1: Get Frequencies
• Scan the file to be compressed and count the occurrence of each character
• Sort the characters based on their frequency
• Step 2: Build Tree & Assign Codes
• Build a Huffman-code tree (binary tree)
• Traverse the tree to assign codes
• Step 3: Encode (Compress)
• Scan the file again and replace each character by its code
• Step 4: Decode (Decompress)
• Huffman tree is the key to decompress the file
Step 1: Get Frequencies
Eerie eyes seen near lake.
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
Input File:
Step 2: Build Huffman Tree & Assign Codes
• It is a binary tree in which each character is a leaf node
• Initially each node is a separate root
• At each step
• Select two roots with smallest frequency and connect them to a new
parent (Break ties arbitrary) [The greedy choice]
• The parent will get the sum of frequencies of the two child nodes
• Repeat until you have one root
Example
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
☐
4
e
8
Each char. has a
leaf node with its
frequency
Find the smallest two frequencies…Replace them with their parent
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
☐
4
e
8
E
1
i
1
2
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
☐
4
e
8
2
Find the smallest two frequencies…Replace them
with their parent
y
1
l
1
2
E
1
i
1
k
1
.
1
r
2
s
2
n
2
a
2
☐
4
e
8
2
y
1
l
1
2
Find the smallest two frequencies…Replace them
with their parent
k
1
.
1
2
E
1
i
1
r
2
s
2
n
2
a
2
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
Find the smallest two frequencies…Replace them
with their parent
r
2
s
2
4
E
1
i
1
n
2
a
2
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
Find the smallest two frequencies…Replace them
with their parent
n
2
a
2
4
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
Find the smallest two frequencies…Replace them
with their parent
E
1
i
1
2
y
1
l
1
2
4
E
1
i
1
☐
4
e
82
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4 4
Find the smallest two frequencies…Replace them
with their parent
☐
4
k
1
.
1
2
6
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4 4 6
Find the smallest two frequencies…Replace them
with their parent
r
2
s
2
4
n
2
a
2
4
8
E
1
i
1
☐
4
e
82
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6 8
Find the smallest two frequencies…Replace them
with their parent
E
1
i
1
☐
4
2
y
1
l
1
2
k
1
.
1
2
4
6
10
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2r
2
s
2
4
n
2
a
2
4 4
6
8 10
Find the smallest two frequencies…Replace them
with their parent
e
8
r
2
s
2
4
n
2
a
2
4
8
16
E
1
i
1
☐
4
e
82
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10 16
Find the smallest two frequencies…Replace them
with their parent
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
Now we have a single root…This is the Huffman Tree
Lets Analyze Huffman Tree
• All characters are at the leaf nodes
• The number at the root = # of characters in the file
• High-frequency chars (E.g., “e”) are near the root
• Low-frequency chars are far from the root
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
Lets Assign Codes
• Traverse the tree
• Any left edge  add label 0
• As right edge  add label 1
• The code for each character is its root-to-leaf label sequence
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
• Traverse the tree
• Any left edge  add label 0
• As right edge  add label 1
• The code for each character is its root-to-leaf label sequence
E
1
i
1
☐
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26
0
1
0
0
0
0
0
0 0
1
1
11
1
1
1
10
001 1
Lets Assign Codes
• Traverse the tree
• Any left edge  add label 0
• As right edge  add label 1
• The code for each character is its root-to-leaf label sequence
Lets Assign Codes
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
space☐ 011
e 10
r 1100
s 1101
n 1110
a 1111
Coding Table
Huffman Algorithm
• Step 1: Get Frequencies
• Scan the file to be compressed and count the occurrence of each character
• Sort the characters based on their frequency
• Step 2: Build Tree & Assign Codes
• Build a Huffman-code tree (binary tree)
• Traverse the tree to assign codes
• Step 3: Encode (Compress)
• Scan the file again and replace each character by its code
• Step 4: Decode (Decompress)
• Huffman tree is the key to decompess the file
Generate the
encoded file
Step 3: Encode (Compress) The File
Eerie eyes seen near lake.
Input File:
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
space☐ 011
e 10
r 1100
s 1101
n 1110
a 1111
Coding Table
+
000010 1100 000110 ….
Notice that no code is prefix to any other code 
Ensures the decoding will be unique (Unlike Slide 8)
Step 4: Decode (Decompress)
• Must have the encoded file + the coding tree
• Scan the encoded file
• For each 0  move left in the tree
• For each 1  move right
• Until reach a leaf node  Emit that character and go back to the root
000010 1100 000110 ….
+
Eerie …
Generate the
original file
Huffman Algorithm
• Step 1: Get Frequencies
• Scan the file to be compressed and count the occurrence of each character
• Sort the characters based on their frequency
• Step 2: Build Tree & Assign Codes
• Build a Huffman-code tree (binary tree)
• Traverse the tree to assign codes
• Step 3: Encode (Compress)
• Scan the file again and replace each character by its code
• Step 4: Decode (Decompress)
• Huffman tree is the key to decompess the file
The Algorithm
• An appropriate data structure is a binary min-heap
• Rebuilding the heap is lgn and n-1 extractions are made, so the complexity is
O( nlgn)
• The encoding is NOT unique, other encoding may work just as well, but none
will work better
Lab Assignment
• Example Input: Huffman coding is a data compression algorithm.
• Output:
Huffman Codes

Huffman Codes

  • 1.
    Greedy Algorithms (Huffman Coding) Preparedby Md. Shafiuzzaman Lecturer, Dept. of CSE, JUST
  • 2.
    David Huffman’s idea •A Term paper at MIT • Build the tree (code) bottom-up in a greedy fashion
  • 3.
    Huffman Coding • Atechnique to compress data effectively • Usually between 20%-90% compression • Lossless compression • No information is lost • When decompress, you get the original file Original file Compressed file Huffman coding
  • 4.
    Huffman Coding: Applications •Saving space • Store compressed files instead of original files • Transmitting files or data • Send compressed data to save transmission time and power • Encryption and decryption • Cannot read the compressed file without knowing the “key” Original file Compressed file Huffman coding
  • 5.
    Main Idea: Frequency-BasedEncoding • Assume in this file only 6 characters appear • E, A, C, T, K, N • The frequencies are: Character Frequency E 10,000 A 4,000 C 300 T 200 K 100 N 100 • Option I (No Compression) – Each character = 1 Byte (8 bits) – Total file size = 14,700 * 8 = 117,600 bits • Option 2 (Fixed size compression) – We have 6 characters, so we need 3 bits to encode them – Total file size = 14,700 * 3 = 44,100 bits Character Fixed Encoding E 000 A 001 C 010 T 100 K 110 N 111
  • 6.
    Main Idea: Frequency-BasedEncoding (Cont’d) • Assume in this file only 6 characters appear • E, A, C, T, K, N • The frequencies are: Character Frequency E 10,000 A 4,000 C 300 T 200 K 100 N 100• Option 3 (Huffman compression) – Variable-length compression – Assign shorter codes to more frequent characters and longer codes to less frequent characters – Total file size: Char. HuffmanEncoding E 0 A 10 C 110 T 1110 K 11110 N 11111 (10,000 x 1) + (4,000 x 2) + (300 x 3) + (200 x 4) + (100 x 5) + (100 x 5) = 20,700 bits
  • 7.
    Huffman Coding • Avariable-length coding for characters • More frequent characters  shorter codes • Less frequent characters  longer codes • It is not like ASCII coding where all characters have the same coding length (8 bits) • Two main questions • How to assign codes (Encoding process)? • How to decode (from the compressed file, generate the original file) (Decoding process)?
  • 8.
    Decoding for fixed-lengthcodes is much easier Character Fixed-length Encoding E 000 A 001 C 010 T 100 K 110 N 111 010001100110111000 010 001 100 110 111 000 Divide into 3’s C A T K N E Decode
  • 9.
    Decoding for variable-lengthcodes is not that easy… Character Variable-length Encoding E 0 A 00 C 001 … … … … … … 000001 It means what??? EEEC EAC AEC Huffman encoding guarantees to avoid this uncertainty …Always have a single decoding
  • 10.
    Huffman Algorithm • Step1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompress the file
  • 11.
    Step 1: GetFrequencies Eerie eyes seen near lake. Char Freq. Char Freq. Char Freq. E 1 y 1 k 1 e 8 s 2 . 1 r 2 n 2 i 1 a 2 space 4 l 1 Input File:
  • 12.
    Step 2: BuildHuffman Tree & Assign Codes • It is a binary tree in which each character is a leaf node • Initially each node is a separate root • At each step • Select two roots with smallest frequency and connect them to a new parent (Break ties arbitrary) [The greedy choice] • The parent will get the sum of frequencies of the two child nodes • Repeat until you have one root
  • 13.
    Example Char Freq. CharFreq. Char Freq. E 1 y 1 k 1 e 8 s 2 . 1 r 2 n 2 i 1 a 2 space 4 l 1 E 1 i 1 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 ☐ 4 e 8 Each char. has a leaf node with its frequency
  • 14.
    Find the smallesttwo frequencies…Replace them with their parent E 1 i 1 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 ☐ 4 e 8 E 1 i 1 2
  • 15.
    E 1 i 1 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 ☐ 4 e 8 2 Find the smallesttwo frequencies…Replace them with their parent y 1 l 1 2
  • 16.
    E 1 i 1 k 1 . 1 r 2 s 2 n 2 a 2 ☐ 4 e 8 2 y 1 l 1 2 Find the smallesttwo frequencies…Replace them with their parent k 1 . 1 2
  • 17.
    E 1 i 1 r 2 s 2 n 2 a 2 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 Find the smallesttwo frequencies…Replace them with their parent r 2 s 2 4
  • 18.
    E 1 i 1 n 2 a 2 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 Find the smallesttwo frequencies…Replace them with their parent n 2 a 2 4
  • 19.
    E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 Find the smallesttwo frequencies…Replace them with their parent E 1 i 1 2 y 1 l 1 2 4
  • 20.
    E 1 i 1 ☐ 4 e 82 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 Find thesmallest two frequencies…Replace them with their parent ☐ 4 k 1 . 1 2 6
  • 21.
    E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 Findthe smallest two frequencies…Replace them with their parent r 2 s 2 4 n 2 a 2 4 8
  • 22.
    E 1 i 1 ☐ 4 e 82 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 8 Find thesmallest two frequencies…Replace them with their parent E 1 i 1 ☐ 4 2 y 1 l 1 2 k 1 . 1 2 4 6 10
  • 23.
    E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2r 2 s 2 4 n 2 a 2 4 4 6 8 10 Findthe smallest two frequencies…Replace them with their parent e 8 r 2 s 2 4 n 2 a 2 4 8 16
  • 24.
    E 1 i 1 ☐ 4 e 82 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 8 10 16 Find thesmallest two frequencies…Replace them with their parent
  • 25.
  • 26.
    Lets Analyze HuffmanTree • All characters are at the leaf nodes • The number at the root = # of characters in the file • High-frequency chars (E.g., “e”) are near the root • Low-frequency chars are far from the root E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 8 10 16 26
  • 27.
    Lets Assign Codes •Traverse the tree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 8 10 16 26
  • 28.
    • Traverse thetree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence E 1 i 1 ☐ 4 e 8 2 y 1 l 1 2 k 1 . 1 2 r 2 s 2 4 n 2 a 2 4 4 6 8 10 16 26 0 1 0 0 0 0 0 0 0 1 1 11 1 1 1 10 001 1 Lets Assign Codes
  • 29.
    • Traverse thetree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence Lets Assign Codes Char Code E 0000 i 0001 y 0010 l 0011 k 0100 . 0101 space☐ 011 e 10 r 1100 s 1101 n 1110 a 1111 Coding Table
  • 30.
    Huffman Algorithm • Step1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompess the file
  • 31.
    Generate the encoded file Step3: Encode (Compress) The File Eerie eyes seen near lake. Input File: Char Code E 0000 i 0001 y 0010 l 0011 k 0100 . 0101 space☐ 011 e 10 r 1100 s 1101 n 1110 a 1111 Coding Table + 000010 1100 000110 …. Notice that no code is prefix to any other code  Ensures the decoding will be unique (Unlike Slide 8)
  • 32.
    Step 4: Decode(Decompress) • Must have the encoded file + the coding tree • Scan the encoded file • For each 0  move left in the tree • For each 1  move right • Until reach a leaf node  Emit that character and go back to the root 000010 1100 000110 …. + Eerie … Generate the original file
  • 33.
    Huffman Algorithm • Step1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompess the file
  • 34.
    The Algorithm • Anappropriate data structure is a binary min-heap • Rebuilding the heap is lgn and n-1 extractions are made, so the complexity is O( nlgn) • The encoding is NOT unique, other encoding may work just as well, but none will work better
  • 35.
    Lab Assignment • ExampleInput: Huffman coding is a data compression algorithm. • Output: