Unit 2 Lecture notes on Huffman coding

Lecture Notes on Huffman Coding
for
Open Educational Resource
on
Data Compression(CA209)
by
Dr. Piyush Charan
Assistant Professor
Department of Electronics and Communication Engg.
Integral University, Lucknow
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Coding or Encoding
• Every information in computer science is encoded as strings
of 1s and 0s. The objective of information theory is to usually
transmit information using fewest number of bits in such a
way that every encoding is unambiguous. This tutorial
discusses about fixed-length and variable-length encoding
along with Huffman Encoding which is the basis for all data
encoding schemes
• Encoding, in computers, can be defined as the process of
transmitting or storing sequence of characters efficiently.
Fixed-length and variable length are two types of encoding
schemes, explained as follows-
02 February 2021 Dr. Piyush Charan, Dept. of ECE, Integral University, Lucknow 2

Fixed and Variable Length Codes
• Fixed-Length encoding - Every character is assigned a binary code using same
number of bits. Thus, a string like “aabacdad” can require 64 bits (8 bytes) for
storage or transmission, assuming that each character uses 8 bits.
• Variable- Length encoding - As opposed to Fixed-length encoding, this scheme
uses variable number of bits for encoding the characters depending on their
frequency in the given text. Thus, for a given string like “aabacdad”, frequency of
characters ‘a’, ‘b’, ‘c’ and ‘d’ is 4,1,1 and 2 respectively. Since ‘a’ occurs more
frequently than ‘b’, ‘c’ and ‘d’, it uses least number of bits, followed by ‘d’, ‘b’ and
‘c’. Suppose we randomly assign binary codes to each character as follows-
• a 0
b 011
c 111
d 11
• Thus, the string “aabacdad” gets encoded to 00011011111011 (0 | 0 | 011 | 0 | 111 |
11 | 0 | 11), using fewer number of bits compared to fixed-length encoding scheme.

Huffman Algorithm
• It is a lossless data compression algorithm.
• We assign variable-length codes to input characters, length of
which depends on frequency of characters.
• The variable-length codes assigned to input characters are
Prefix Codes.
02 February 2021 15
Dr. Piyush Charan, Dept. of ECE, Integral University, Lucknow

Types of Coding
• Different types of codes:
– fixed length code: Each codeword uses the same number
of bits.
– variable length code: In this case, each codeword can use
different numbers of bits.
02 February 2021 16
Dr. Piyush Charan, Dept. of ECE, Integral University, Lucknow

Information and Entropy
• Information theory is concerned with data compression and
transmission and builds upon probability and supports
machine learning.
• Information provides a way to quantify the amount of surprise
for an event measured in bits.
• Entropy provides a measure of the average amount of
information needed to represent an event drawn from a
probability distribution for a random variable.

What Is Information Theory?
• Information theory is a field of study concerned with quantifying
information for communication.
• It is a subfield of mathematics and is concerned with topics like data
compression and the limits of signal processing. The field was
proposed and developed by Claude Shannon while working at the
US telephone company Bell Labs.
• Information theory is concerned with representing data in a compact
fashion (a task known as data compression or source coding), as
well as with transmitting and storing it in a way that is robust to
errors (a task known as error correction or channel coding).

• A foundational concept from information is the quantification
of the amount of information in things like events, random
variables, and distributions.
• Quantifying the amount of information requires the use of
probabilities, hence the relationship of information theory to
probability.
• Measurements of information are widely used in artificial
intelligence and machine learning, such as in the construction
of decision trees and the optimization of classifier models.

Huffman Algorithm
• Step 1: Create a leaf node for each unique character and build a min heap
of all leaf nodes.
• Step 2: Extract two nodes with the minimum frequency from the min heap.
• Step 3: Create a new internal node with frequency equal to the sum of the
two nodes frequencies. Make the first extracted node as its left child and
the other extracted node as its right child. Add this node to the min heap.
• Step 4: Repeat steps#2 and #3 until the heap contains only one node. The
remaining node is the root node and the tree is complete.

Huffman Algorithm
• Lets see how “ABRACADABRA” translates into these sequences of 0’s
and 1’s
• 01011011010000101001011011010
Example 2
C
A R
D B
0 1
0
0
0 1
1
1

File Compression
• Text files are usually stored by representing each character with an 8-bit
ASCII code (type man ascii in a Unix shell to see the ASCII encoding)
• The ASCII encoding is an example of fixed-length encoding, where each
character is represented with the same number of bits.
• In order to reduce the space required to store a text file, we can exploit the
fact that some characters are more likely to occur than others.
• Variable-length encoding uses binary codes of different lengths for different
characters. Thus, we can assign fewer bits to frequently used characters
and more bits to rarely used characters.

File Compression: Example
• An Encoding Example
– Text: java
– encoding: a=“0”, j=“11”, v=“10”
– encoded text: 110100（6 bits)
• How to decode (problems in ambiguity)?
– Encoding: a=“0”, j=“01”, v=“00”
– encoded text: 010000 (6 bits)
– could be "java", or "jvv", or "jaaaa"

Encoding Tree
• To prevent ambiguities in decoding, we require that the encoding
satisfies the prefix rule: no code is a prefix of another.
• a=“0”, j=“11”, v=“10” satisfies the prefix rule
• a=“0”, j=“01”, v=“00” does not satisfy the prefix rule (the code of
‘a’ is a prefix of the codes of ‘j’ and ‘v’)
• Note: if your codes satisfy the prefix rule, then decoding will be
unambiguous. But if it does not then you will have ambiguity in
decoding.

Encoding Tree contd…
• We use an encoding trie to satisfy this prefix rule
– the characters are stored at the external nodes.
– a left child (edge) means 0
– a right child (edge) means 1
C
A R
D B
0 1
0
0
0 1
1
A= 010
B= 11
C= 00
D= 10
R= 011
1
Root

Example of Decoding
• We trace the character from the root to the particular leaf.
• Trie
C
A R
D B
0 1
0
0
0 1
1
A= 010
B= 11
C= 00
D= 10
R= 011
1
Root
•Encoded text: 01011011010000101001011011010
•Text: ABRACADABRA

Unit 2 Lecture notes on Huffman coding

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Unit 2 Lecture notes on Huffman coding

Similar to Unit 2 Lecture notes on Huffman coding (20)

More from Dr Piyush Charan

More from Dr Piyush Charan (20)

Recently uploaded

Recently uploaded (20)

Unit 2 Lecture notes on Huffman coding