Huffman analysis

 An optimization problem is one in which
you want to find, not just a solution, but
the best solution
 A “greedy algorithm” sometimes works
well for optimization problems
 A greedy algorithm works in phases. At
each phase:
◦ You take the best you can get right
now, without regard for future
consequences
◦ You hope that by choosing a local
optimum at each step, you will end up
at a global optimum 2

 Suppose you want to count out a certain
amount of money, using the fewest possible
bills and coins
 A greedy algorithm would do this would be:
At each step, take the largest possible bill or
coin that does not overshoot
◦ Example: To make $6.39, you can choose:
 a $5 bill
 a $1 bill, to make $6
 a 25¢ coin, to make $6.25
 A 10¢ coin, to make $6.35
 four 1¢ coins, to make $6.39
 For US money, the greedy algorithm always
gives the optimum solution
3

 In some (fictional) monetary system, “krons”
come in 1 kron, 7 kron, and 10 kron coins
 Using a greedy algorithm to count out 15
krons, you would get
◦ A 10 kron piece
◦ Five 1 kron pieces, for a total of 15 krons
◦ This requires six coins
 A better solution would be to use two 7
kron pieces and one 1 kron piece
◦ This only requires three coins
 The greedy algorithm results in a solution,
but not in an optimal solution
4

 In general, greedy algorithms have five
components:
 A candidate set, from which a solution is
created
 A selection function, which chooses the
best candidate to be added to the solution
 A feasibility function, that is used to
determine if a candidate can be used to
contribute to a solution
 An objective function, which assigns a
value to a solution, or a partial solution,
and
 A solution function, which will indicate
when we have discovered a complete 5

 Huffman code is a technique for
compressing data. Huffman's greedy
algorithm look at the occurrence of each
character and it as a binary string in an
optimal way.

 Suppose we have a data consists of 100,000
characters that we want to compress. The
characters in the data occur with following
frequencies.

 Consider the problem of designing a "binary character code"
in which each character is represented by a unique binary
string.
 This method require 3000,000 bits to code the entire file.
 How do we get 3000,000?

 In fixed length code, needs 3 bits to represent
six(6) characters.
 Total number of characters are 45,000 + 13,000
+ 12,000 + 16,000 + 9,000 + 5,000 =
1000,000.
 Add each character is assigned 3-bit codeword
=> 3 * 1000,000 = 3000,000 bits.

 Fixed-length code requires 300,000 bits
while variable code requires 224,000 bits.
 => Saving of approximately 25%.

 In which no codeword is a prefix of other codeword. The
reason prefix codes are desirable is that they simply encoding
(compression) and decoding.
 Better….???
 A variable-length code can do better by giving frequent
characters short codewords and infrequent characters long
codewords.

 Character 'a' are 45,000
each character 'a' assigned 1 bit codeword.
1 * 45,000 = 45,000 bits.

 Characters (b, c, d) are 13,000 + 12,000 +
16,000 = 41,000
each character assigned 3 bit codeword
3 * 41,000 = 123,000 bits
 Characters (e, f) are 9,000 + 5,000 = 14,000
each character assigned 4 bit codeword.
4 * 14,000 = 56,000 bits.
 Implies that the total bits are: 45,000 + 123,000
+ 56,000 = 224,000 bits.

 Concatenate the codewords representing each characters of the file.
 Example
 From variable-length codes
table, we code the3-character
file abc as:

 Since no codeword is a prefix of other, the codeword that
begins an encoded file is unambiguous.
 To decode (Translate back to the original character), remove it
from the encode file and repeatedly parse.
 For example in "variable-length codeword" table, the string
001011101 parse uniquely as 0.0.101.1101, which is decode to
aabe.

 The representation of "decoding process" is binary tree, whose
leaves are characters.
 We interpret the binary codeword for a character as path from
the root to that character, where 0 means "go to the left child"
and 1 means "go to the right child".
 Note that an optimal code for a file is always represented by a
full (complete) binary tree.

 A Binary tree that is not full cannot correspond to an optimal
prefix code.
 Proof: Let T be a binary tree corresponds to prefix code such
that T is not full. Then there must exist an internal node, say x,
such that x has only one child, y. Construct another binary tree,
T`, which has save leaves as T and have same depth as T except
for the leaves which are in the subtree rooted at y in T. These
leaves will have depth in T`, which implies T cannot correspond
to an optimal prefix code.
 To obtain T`, simply merge x and y into a single node, z is a child
of parent of x (if a parent exists) and z is a parent to any children
of y. Then T` has the desired properties: it corresponds to a code
on the same alphabet as the code which are obtained, in the
subtree rooted at y in T have depth in T` strictly less (by one)
than their depth in T.

 Fixed-length code is not optimal since binary tree is not full.
 Optimal prefix code because tree is full binary

 If C is the alphabet from which characters are drawn, then the tree for an
optimal prefix code has exactly |c| leaves (one for each letter) and exactly |c|-1
internal orders.
 Given a tree T corresponding to the prefix code, compute the number of bits
required to encode a file.
 For each character c in C, let f(c) be the frequency of c and let dT(c) denote the
depth of c's leaf. Note that dT(c) is also the length of codeword. The number of
bits to encode a file is

 B (T) = S f(c) dT(c)
 which define as the cost of the tree T.
 For example, the cost of the above tree is
 B (T) = S f(c) dT(c)
= 45*1 +13*3 + 12*3 + 16*3 + 9*4 +5*4
= 224
 Therefore, the cost of the tree corresponding to the optimal prefix code is 224
(224*1000 = 224000).

 A greedy algorithm that constructs an optimal prefix code called a Huffman
code. The algorithm builds the tree T corresponding to the optimal code in a
bottom-up manner. It begins with a set of |c| leaves and perform |c|-1 "merging"
operations to create the final tree.
 Data Structure used: Priority queue = Q
Huffman (c)
n = |c|
Q = c
for i =1 to n-1
do z = Allocate-Node ()
x = left[z] = EXTRACT_MIN(Q)
y = right[z] = EXTRACT_MIN(Q)
f[z] = f[x] + f[y]
INSERT (Q, z)
return EXTRACT_MIN(Q)

 Q implemented as a binary heap.
 line 2 can be performed by using BUILD-HEAP in O(n) time.
 FOR loop executed |n| - 1 times and since each heap operation requires O(lg n)
time.
=> the FOR loop contributes (|n| - 1) O(lg n)
=> O(n lg n)
 Thus the total running time of Huffman on the set of n characters is O(nlg n).

 Proof Idea
 Step 1: Show that this problem satisfies the greedy choice property, that is, if a
greedy choice is made by Huffman's algorithm, an optimal solution remains
possible.
 Step 2: Show that this problem has an optimal substructure property, that is, an
optimal solution to Huffman's algorithm contains optimal solution to
subproblems.
 Step 3: Conclude correctness of Huffman's algorithm using step 1 and step 2.

 Let c be an alphabet in which each character c has frequency f[c]. Let x and
y be two characters in C having the lowest frequencies. Then there exists an
optimal prefix code for C in which the codewords for x and y have the same
length and differ only in the last bit.
 Proof:
 Let characters b and c are sibling leaves of maximum depth in tree T.
Without loss of generality assume that f[b] ≥ f[c] and f[x] ≤ f[y].
 Since f[x] and f[y] are lowest leaf frequencies in order and f[b] and f[c] are
arbitrary frequencies in order. We have f[x] ≤ f[b] and f[y] ≤ f[c].
Exchange the positions of leaves to get first T` and then T``. By formula,
B(t) = c in C f(c)dT(c), the difference in cost between T and T` is

 B(T) - B(T`) = f[x]dT(x) + f(b)dT(b) - [f[x]dT(x)
+ f[b]dT`(b)
= (f[b] - f[x]) (dT(b) - dT(x))
= (non-negative)(non-negative)
≥ 0

 Scheduling problem using greedy Algo
 Minimum spanning tree.
 Collecting Coins

Huffman analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Huffman analysis

Similar to Huffman analysis (20)

Recently uploaded

Recently uploaded (20)

Huffman analysis

Editor's Notes