Greedy
Algorithms
Huffman Coding
It is Compression technique to reduce the
size of the data or message.
(Lossless data Compression technique)
Computer Data Encoding:
How do we represent data in binary?
Fixed length codes.
Encode every symbol by a unique binary
string of a fixed length.
Examples: ASCII (8 bit code),
American Standard Code for
Information Interchange
ASCII Example
:
ABCA
A B C A
01000001 01000010 01000011 01000001
ASCII Example
:
Message send using ASCII code
ASCII codes are 8 bit code
A-65
B-66
C-67
D-68
E-69
Suppose we have a message
BCCABBDDAECCBBAEDDCC
Total space usage in bits
:
Assume an l bit fixed length code.
For a file of n characters
Need nl bits.
ASCII Example
:
A B C D E
01000001 01000010 01000011 01000100
01000101
So total bits = 8*20 =160bits (8 bit for each alphabet)
Suppose we have a message
BCCABBDDAECCBBAEDDCC
Fixed Length codes
Idea: In order to save space, use less bits.
There are 20 character 20*3=60 bits
Also send the table for receiver end to
understand the code for decode the
message. So total cost of the message is:
Character
Frequency
Code
A
3
000
B
5
001
C
6
010
D
4
011
E
2
100
Fixed Length codes
Idea: In order to save space, use less bits.
Character Size= 20*3=60 bits
Original Character size= 5*8=40 bits
New code for 5 character= 5*3= 15 bits
Total= 115 bits
(along with table)
Character
Frequency
Code
A
3
000
B
5
001
C
6
010
D
4
011
E
2
100
Variable Length codes
Idea: In order to save space, use less bits
for frequent characters and more bits
for rare characters.
The variable length codes assigned to
input character are prefix codes means
the codes (bit sequence) are assigned in
such a way that the code assigned to one
character is not prefix of code assigned
to any other character.
Variable Length codes
Idea: In order to save space, use less bits
for frequent characters and more bits
for rare characters.
Example: suppose alphabet of 3 symbols:
{ A, B, C }.
suppose in file: 1,000,000
characters.
Need 2 bits for a fixed length
code for a total of
2,000,000 bits.
Variable Length codes - example
A
B
C
999,000
500
500
Suppose the frequency distribution of the
characters is:
A
B
C
0
10
11
Note that the code of A is of length 1, and the codes for B
and C are of length 2
Encode:
Fixed code: 1,000,000 x 2 = 2,000,000
Varable code: 999,000 x 1
+ 500 x 2
500 x 2
1,001,000
Total space usage in bits
:
A savings of almost 50%
How do we decode
?
In the fixed length, we know where every
character starts, since they all have the
same number of bits.
Example: A = 00
B = 01
C = 10
000000010110101001100100001010
A A A B B C C C B C B A A C C
How do we decode
?
In the variable length code, we use an
idea called Prefix code, where no code is a
prefix of another.
Example: A = 0
B = 10
C = 11
None of the above codes is a prefix of
another.
Prefix Code
Let us understand prefix code with a
counter example: Let there be four
character a,b,c and d and their
corresponding variable length codes be
00, 01, 0, 1.
This coding leads to ambiguity because
code assigned to c is prefix of codes
assigned to a and b if the compressed bit
stream is 0001, the de-compressed output
may be cccd or ccb or acd or ab.
How do we decode
?
Example: A = 0
B = 10
C = 11
So, for the string:
A A A B B C C C B C B A A C C the encoding:
0 0 01010111111101110 0 01111
Prefix Code
Example: A = 0
B = 10
C = 11
Decode the string
0 0 01010111111101110 0 01111
A A A B B C C C B C B A A C C
Requirement
:
Construct a variable length code for a
given file with the following properties:
1. Prefix code.
2. Using shortest possible codes.
3. Efficient.
There are mainly two major parts in
Huffman Coding:
1. Build a Huffman tree from input
characters.
2. traverse the Huffman tree and assigned
codes to character.
Huffman Tree
Steps to Build Huffman Tree
:
1. Create a leaf node for each unique character and build
a min heap of all leaf nodes.
2. Extract two nodes with the minimum frequency from
the min heap.
3. Create a new internal node with frequency equal to the
sum of the nodes frequencies. Make the first
extracted node as its left child and the other
extracted node as its right child add this node to the
min heap.
repeat steps 2 and 3 until the heap contain only one node.
After completion of the tree assign 0 to left child and 1
to right child in whole tree.
Idea
Consider a binary tree, with:
0 meaning a left turn
1 meaning a right turn.
0
0
0
1
1
1
A
B
C D
Huffman Tree Example
:
Alphabet: A, B, C, D, E, F
Frequency table:
A
B
C
D
E
F
10
20
30
40
50
60
Total File Length: 210
Algorithm Run
:
A 10 B 20 C 30 D 40 E 50 F 60
Algorithm Run
:
A 10 B 20
C 30 D 40 E 50 F 60
X 30
Algorithm Run
:
A 10 B 20
C 30
D 40 E 50 F 60
X 30
Y 60
Algorithm Run
:
A 10 B 20
C 30
D 40 E 50 F 60
X 30
Y 60
Algorithm Run
:
A 10 B 20
C 30
F 60
X 30
Y 60
D 40 E 50
Z 90
Algorithm Run
:
A 10 B 20
C 30
F 60
X 30
Y 60
D 40 E 50
Z 90
Algorithm Run
:
A 10 B 20
C 30
F 60
X 30
Y 60 D 40 E 50
Z 90
W 120
Algorithm Run
:
A 10 B 20
C 30
F 60
X 30
Y 60
D 40 E 50
Z 90 W 120
Algorithm Run
:
A 10 B 20
C 30
F 60
X 30
Y 60
D 40 E 50
Z 90 W 120
V 210
0
0 0
0
0
1
1
1
1
1
The Huffman encoding
:
A 10 B 20
C 30
F 60
X 30
Y 60
D 40 E 50
Z 90 W 120
V 210
0
0 0
0
0
1
1
1
1
1
A: 1000
B: 1001
C: 101
D: 00
E: 01
F: 11
File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 =
40 + 80 + 90 + 80 + 100 + 120 = 510 bits
Note the savings
:
The Huffman code:
Required 510 bits for the file.
Fixed length code:
Need 3 bits for 6 characters.
File has 210 characters.
Total: 630 bits for the file.
Example: Construct a Huffman
Code for the following data and
also calculate the cost of the
tree
.
Character A B C D E
Probability 12 04 45 16 23
The Huffman encoding
:
B 12 A 04
D 16
W 60
X 32 E 23
Y 55 C 45
Z 100
0
0
0
0
1
1
1
1
A: 0001
B: 0000
C: 1
D: 001
E: 01
File Size: 4x4 + 12x4 + 45x1 + 16x3 + 23x2 =
16 + 48 + 45 + 48 + 46 = 203 bits
Example: Construct a Huffman
Code for the following data and
also calculate the cost of the
tree and decode the code
1101000010001
.
Character A B C D E F
Probability0.35 0.12 0.21 0.05 0.18 0.09
Example: Construct a Huffman
Code for the following message
their occurrence are given below
and decode the code whose
ending using the Huffman code
001110001010000010
.
Character A B C D E F G
Probability 23 10 03 21 20 06 17
Example: Construct a Huffman
Code for the following message
and decode the code whose
ending using the Huffman code
100010111001010
.
Character A B C D E
Probability 0.4 0.1 0.2 0.15 0.15
The Huffman encoding
:
B 0.1 A 0.4
D 0.15
W 0.6
X 0.25
E 0.15
Y 0.35
C 0.2
Z 1
0
0
0
0 1
1
1
1
A: 1
B: 000
C: 001
D: 001
E: 010
100010111001010
How to decode the code
100010111001010
Huffman Tree
:
As extractMin( ) calls minHeapify( ), it
takes O(logn) time
.
In each iteration: one less subtree.
Initially: n subtrees.
Total: O(n log n) time.
Advantages of Huffman Encoding
-
1
)
This encoding scheme results in saving lot of
storage space, since the binary codes generated
are variable in length
.
2
)
It generates shorter binary codes for encoding
symbols/characters that appear more frequently
in the input string
.
3
)
The binary codes generated are prefix-free
.
Disadvantages of Huffman Encoding
-
1
)
Lossless techniques like Huffman encoding are
suitable only for encoding text and program files and are
unsuitable for encoding digital images
.
2
)
Huffman encoding is a relatively slower process since
it uses two passes-one for building the statistical model
and another for encoding. Thus, the lossless techniques
that use Huffman encoding are considerably slower than
others
.
3
)
Since length of all the binary codes is different, it
becomes difficult for the decoding software to detect
whether the encoded data is corrupt. This can result in
an incorrect decoding and subsequently, a wrong output
.
Real-life applications of Huffman
Encoding
1
)
Huffman encoding is widely used in
compression formats like GZIP, PKZIP
(winzip) and BZIP2
.
2
)
Multimedia codecs like PNG and
MP3 uses Huffman encoding (to be more
precised the prefix codes)
.

Huffman code presentation and their operation

  • 1.
  • 2.
    Huffman Coding It isCompression technique to reduce the size of the data or message. (Lossless data Compression technique)
  • 3.
    Computer Data Encoding: Howdo we represent data in binary? Fixed length codes. Encode every symbol by a unique binary string of a fixed length. Examples: ASCII (8 bit code),
  • 4.
    American Standard Codefor Information Interchange
  • 5.
    ASCII Example : ABCA A BC A 01000001 01000010 01000011 01000001
  • 6.
    ASCII Example : Message sendusing ASCII code ASCII codes are 8 bit code A-65 B-66 C-67 D-68 E-69 Suppose we have a message BCCABBDDAECCBBAEDDCC
  • 7.
    Total space usagein bits : Assume an l bit fixed length code. For a file of n characters Need nl bits.
  • 8.
    ASCII Example : A BC D E 01000001 01000010 01000011 01000100 01000101 So total bits = 8*20 =160bits (8 bit for each alphabet) Suppose we have a message BCCABBDDAECCBBAEDDCC
  • 9.
    Fixed Length codes Idea:In order to save space, use less bits. There are 20 character 20*3=60 bits Also send the table for receiver end to understand the code for decode the message. So total cost of the message is: Character Frequency Code A 3 000 B 5 001 C 6 010 D 4 011 E 2 100
  • 10.
    Fixed Length codes Idea:In order to save space, use less bits. Character Size= 20*3=60 bits Original Character size= 5*8=40 bits New code for 5 character= 5*3= 15 bits Total= 115 bits (along with table) Character Frequency Code A 3 000 B 5 001 C 6 010 D 4 011 E 2 100
  • 11.
    Variable Length codes Idea:In order to save space, use less bits for frequent characters and more bits for rare characters. The variable length codes assigned to input character are prefix codes means the codes (bit sequence) are assigned in such a way that the code assigned to one character is not prefix of code assigned to any other character.
  • 12.
    Variable Length codes Idea:In order to save space, use less bits for frequent characters and more bits for rare characters. Example: suppose alphabet of 3 symbols: { A, B, C }. suppose in file: 1,000,000 characters. Need 2 bits for a fixed length code for a total of 2,000,000 bits.
  • 13.
    Variable Length codes- example A B C 999,000 500 500 Suppose the frequency distribution of the characters is: A B C 0 10 11 Note that the code of A is of length 1, and the codes for B and C are of length 2 Encode:
  • 14.
    Fixed code: 1,000,000x 2 = 2,000,000 Varable code: 999,000 x 1 + 500 x 2 500 x 2 1,001,000 Total space usage in bits : A savings of almost 50%
  • 15.
    How do wedecode ? In the fixed length, we know where every character starts, since they all have the same number of bits. Example: A = 00 B = 01 C = 10 000000010110101001100100001010 A A A B B C C C B C B A A C C
  • 16.
    How do wedecode ? In the variable length code, we use an idea called Prefix code, where no code is a prefix of another. Example: A = 0 B = 10 C = 11 None of the above codes is a prefix of another.
  • 17.
    Prefix Code Let usunderstand prefix code with a counter example: Let there be four character a,b,c and d and their corresponding variable length codes be 00, 01, 0, 1. This coding leads to ambiguity because code assigned to c is prefix of codes assigned to a and b if the compressed bit stream is 0001, the de-compressed output may be cccd or ccb or acd or ab.
  • 18.
    How do wedecode ? Example: A = 0 B = 10 C = 11 So, for the string: A A A B B C C C B C B A A C C the encoding: 0 0 01010111111101110 0 01111
  • 19.
    Prefix Code Example: A= 0 B = 10 C = 11 Decode the string 0 0 01010111111101110 0 01111 A A A B B C C C B C B A A C C
  • 20.
    Requirement : Construct a variablelength code for a given file with the following properties: 1. Prefix code. 2. Using shortest possible codes. 3. Efficient.
  • 21.
    There are mainlytwo major parts in Huffman Coding: 1. Build a Huffman tree from input characters. 2. traverse the Huffman tree and assigned codes to character. Huffman Tree
  • 22.
    Steps to BuildHuffman Tree : 1. Create a leaf node for each unique character and build a min heap of all leaf nodes. 2. Extract two nodes with the minimum frequency from the min heap. 3. Create a new internal node with frequency equal to the sum of the nodes frequencies. Make the first extracted node as its left child and the other extracted node as its right child add this node to the min heap. repeat steps 2 and 3 until the heap contain only one node. After completion of the tree assign 0 to left child and 1 to right child in whole tree.
  • 23.
    Idea Consider a binarytree, with: 0 meaning a left turn 1 meaning a right turn. 0 0 0 1 1 1 A B C D
  • 24.
    Huffman Tree Example : Alphabet:A, B, C, D, E, F Frequency table: A B C D E F 10 20 30 40 50 60 Total File Length: 210
  • 25.
    Algorithm Run : A 10B 20 C 30 D 40 E 50 F 60
  • 26.
    Algorithm Run : A 10B 20 C 30 D 40 E 50 F 60 X 30
  • 27.
    Algorithm Run : A 10B 20 C 30 D 40 E 50 F 60 X 30 Y 60
  • 28.
    Algorithm Run : A 10B 20 C 30 D 40 E 50 F 60 X 30 Y 60
  • 29.
    Algorithm Run : A 10B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90
  • 30.
    Algorithm Run : A 10B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90
  • 31.
    Algorithm Run : A 10B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120
  • 32.
    Algorithm Run : A 10B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120
  • 33.
    Algorithm Run : A 10B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120 V 210 0 0 0 0 0 1 1 1 1 1
  • 34.
    The Huffman encoding : A10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120 V 210 0 0 0 0 0 1 1 1 1 1 A: 1000 B: 1001 C: 101 D: 00 E: 01 F: 11 File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 = 40 + 80 + 90 + 80 + 100 + 120 = 510 bits
  • 35.
    Note the savings : TheHuffman code: Required 510 bits for the file. Fixed length code: Need 3 bits for 6 characters. File has 210 characters. Total: 630 bits for the file.
  • 36.
    Example: Construct aHuffman Code for the following data and also calculate the cost of the tree . Character A B C D E Probability 12 04 45 16 23
  • 37.
    The Huffman encoding : B12 A 04 D 16 W 60 X 32 E 23 Y 55 C 45 Z 100 0 0 0 0 1 1 1 1 A: 0001 B: 0000 C: 1 D: 001 E: 01 File Size: 4x4 + 12x4 + 45x1 + 16x3 + 23x2 = 16 + 48 + 45 + 48 + 46 = 203 bits
  • 38.
    Example: Construct aHuffman Code for the following data and also calculate the cost of the tree and decode the code 1101000010001 . Character A B C D E F Probability0.35 0.12 0.21 0.05 0.18 0.09
  • 39.
    Example: Construct aHuffman Code for the following message their occurrence are given below and decode the code whose ending using the Huffman code 001110001010000010 . Character A B C D E F G Probability 23 10 03 21 20 06 17
  • 40.
    Example: Construct aHuffman Code for the following message and decode the code whose ending using the Huffman code 100010111001010 . Character A B C D E Probability 0.4 0.1 0.2 0.15 0.15
  • 41.
    The Huffman encoding : B0.1 A 0.4 D 0.15 W 0.6 X 0.25 E 0.15 Y 0.35 C 0.2 Z 1 0 0 0 0 1 1 1 1 A: 1 B: 000 C: 001 D: 001 E: 010 100010111001010
  • 42.
    How to decodethe code 100010111001010
  • 43.
    Huffman Tree : As extractMin() calls minHeapify( ), it takes O(logn) time . In each iteration: one less subtree. Initially: n subtrees. Total: O(n log n) time.
  • 44.
    Advantages of HuffmanEncoding - 1 ) This encoding scheme results in saving lot of storage space, since the binary codes generated are variable in length . 2 ) It generates shorter binary codes for encoding symbols/characters that appear more frequently in the input string . 3 ) The binary codes generated are prefix-free .
  • 45.
    Disadvantages of HuffmanEncoding - 1 ) Lossless techniques like Huffman encoding are suitable only for encoding text and program files and are unsuitable for encoding digital images . 2 ) Huffman encoding is a relatively slower process since it uses two passes-one for building the statistical model and another for encoding. Thus, the lossless techniques that use Huffman encoding are considerably slower than others . 3 ) Since length of all the binary codes is different, it becomes difficult for the decoding software to detect whether the encoded data is corrupt. This can result in an incorrect decoding and subsequently, a wrong output .
  • 46.
    Real-life applications ofHuffman Encoding 1 ) Huffman encoding is widely used in compression formats like GZIP, PKZIP (winzip) and BZIP2 . 2 ) Multimedia codecs like PNG and MP3 uses Huffman encoding (to be more precised the prefix codes) .