Huffman coding
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2
Optimal codes - I
 A code is optimal if it has the shortest
codeword length L
 This can be seen as an optimization problem
1
m
i i
i
L p l

 
1
1
min
subject to 1
i
m
i i
i
m
l
i
l p
D






Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3
Optimal codes - II
 Let’s make two simplifying assumptions
 no integer constraint on the codelengths
 Kraft inequality holds with equality
 Lagrange-multiplier problem
1 1
1
i
m m
l
i i
i i
J p l D
 
 
 
  
 
 
 
0 log 0
log
j j
l l j
j
j
p
J
p D D D
l D


 

     

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4
Optimal codes - III
 Substitute into the Kraft
inequality
that is
Note that
log
j
l j
p
D
D



1
1
1
log log
i
m
l
i
i
i
p
p D
D D




    

*
log
i D i
l p
 
*
*
1 1
log ( ) !!
m m
i i i D i
i i
D
p l p p
L H X
 
 
 
 
the entropy, when we use
base D for logarithms
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5
Optimal codes - IV
 In practice the codeword lengths must be
integer value, so obtained results is a lower
bound
 Theorem
The expected length of any istantaneous D-ary code
for a r.v. X satisfies
this fundamental result derives frow the work of Shannon
( )
D
L H x

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6
Optimal codes - V
 What about the upper bound?
 Theorem
Given a source alphabet (i.e. a r.v.) of entropy it
is possible to find an instantaneous binary code which
length satisfies
 A similar theorem could be stated if we use the wrong
probabilities instead of the true ones ; the only
difference is a term which accounts for the relative entropy
( )
H X
( ) ( ) 1
H X L H X
  
 
i
p
 
i
q
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7
The redundance
 It is defined as the average codeword
legths minus the entropy
 Note that
(why?)
Redundancy log
i i
i
L p p
 
  
 
 

0 redundancy 1
 
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8
Compression ratio
 It is the ratio between the average number
of bit/symbol in the original message and the
same quantity for the coded message, i.e.
average original symbol length
average compressed symbol length
C
 

 
( )!!
L X

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9
Uniquely decodable codes
 The set of the instantaneous codes are
a small subset of the uniquely
decodable codes.
 It is possible to obtain a lower average
code length L using a uniquely
decodable code that is not
instantaneous? NO
 So we use instantaneous codes that are easier to
decode
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10
Summary
 Average codeword length L
 for uniquely decodable codes
(and for instantaneous codes)
 In practice for each r.v. with entropy
we can build a code with average
codeword length that satisfies
( )
L H X

( )
H X
X
( ) ( ) 1
H X L H X
  
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11
Shannon-Fano coding
 The main advantage of the Shannon-Fano
technique is its semplicity
 Source symbols are listed in order of nonincreasing
probability.
 The list is divided in such a way to form two groups
of as nearly equal probabilities as possible
 Each symbol in the first group receives a 0 as first
digit of its codeword, while the others receive a 1
 Each of these group is then divided according to the
same criterion and additional code digits are
appended
 The process is continued until each group contains
only one message
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12
example
H=1.9375 bits
L=1.9375 bits
1 2
1 4
1 8
1 16
1 32
1 32
a
b
c
d
e
f
0
1
1
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
0
1
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13
Shannon-Fano coding - exercise
 Encode, using Shannon-Fano
algorithm
Symb. Prob.
* 12%
? 5%
! 13%
& 2%
$ 29%
€ 13%
§ 10%
° 6%
@ 10%
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14
Is Shannon-Fano coding optimal?
H=2.2328 bits
L=2.31 bits
0.35
0.17
0.17
0.16
0.15
a
b
c
d
e
00
01
10
110
111
0
100
101
110
111 L1=2.3 bits
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15
Huffman coding - I
 There is another algorithm which
performances are slightly better than
Shanno-Fano, the famous Huffman coding
 It works constructing bottom-up a tree, that
has symbols in the leafs
 The two leafs with the smallest probabilities
becomes sibling under a parent node with
probabilities equal to the two children’s
probabilities
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16
Huffman coding - II
 At this time the operation is repeated,
considering also the new parent node and
ignoring its children
 The process continue until there is only
parent node with probability 1, that is the
root of the tree
 Then the two branches for every non-leaf
node are labeled 0 and 1 (typically, 0 on the
left branch, but the order is not important)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17
Huffman coding - example
0
Symbol Prob.
0.05
0.05
0.1
0.2
0.3
0.2
0.1
a
b
c
d
e
f
g a
0.05
b
0.05
c
0.1
d
0.2
e
0.3
f
0.2
g
0.1
0.1
0.2
0.3
0.4
0.6
1.0
0
0
0
0
0
1
1
1
1
1
1
a
0.05
b
0.05
c
0.1
d
0.2
e
0.3
f
0.2
g
0.1
0.1
0.2
0.3
0.4
0.6
1.0
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18
Huffman coding - example
Exercise: evaluate H(X) and L(X)
H(X)=2.5464 bits
L(X)=2.6 bits !!
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111
g
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19
Huffman coding - exercise
 Code the sequence
aeebcddegfced
and calculate the compression
ratio
Sol: 0000 10 10 0001 001 01 01
10 111 110 001 10 01
Aver. orig. symb. length = 3 bits
Aver. compr. symb. length = 34/13
C=.....
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111
g
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20
Huffman coding - exercise
 Decode the sequence
0111001001000001111110
Sol: dfdcadgf
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111
g
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21
Huffman coding - exercise
 Encode with Huffman the sequence
01$cc0a02ba10
and evaluate entropy, average
codeword length and compression
ratio
Symb. Prob.
0.10
0.03
0.14
0 0.4
1 0.22
2 0.04
$ 0.07
a
b
c
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22
Huffman coding - exercise
Symb. Prob.
0 0.16
1 0.02
2 0.15
3 0.29
4 0.17
5 0.04
% 0.17
 Decode (if possible) the
Huffman coded bit streaming
01001011010011110101...
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 23
Huffman coding - notes
 In the huffman coding, if, at any time, there
is more than one way to choose a smallest
pair of probabilities, any such pair may be
chosen
 Sometimes, the list of probabilities is inizialized to be
non-increasing and reordered after each node
creation. This details doesn’t affect the correctness of
the algorithm, but it provides a more efficient
implementation
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24
Huffman coding - notes
 There are cases in which the Huffman coding
does not uniquely determine codeword
lengths, due to the arbitrary choice among
equal minimum probabilities.
 For example for a source with probabilities
it is possible to obtain
codeword lengths of and of
 It would be better to have a code which codelength has
the minimum variance, as this solution will need the
minimum buffer space in the transmitter and in the
receiver
 
0.4, 0.2, 0.2, 0.1, 0.1
 
1, 2, 3, 4, 4  
2, 2, 2, 3, 3
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25
Huffman coding - notes
 Schwarz defines a variant of the
Huffman algorithm that allows to build
the code with minimum .
 There are several other variants, we
will explain the most important in a
while.
max
l
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26
Optimality of Huffman coding - I
 It is possible to prove that, in case of
character coding (one symbol, one
codeword), Huffman coding is optimal
 In another terms Huffman code has
minimum redundancy
 An upper bound for redundancy has been found
where is the probability of the most likely simbol
 
1 2 2 2 1
redundancy 1 log log log 0.086
p e e p
    
1
p
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27
Optimality of Huffman coding - II
 Why Huffman code “suffers” when there is
one symbol with very high probability?
 Remember the notion of uncertainty...
The main problem is given by the integer
constraint on codelengths!!
 This consideration opens the way to a more powerful
coding... we will see it later
( ) 1 log( ( )) 0
p x p x
   
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28
Huffman coding - implementation
 Huffman coding can be generated in
O(n) time, where n is the number of
source symbols, provided that
probabilities have been presorted
(however this sort costs O(nlogn)...)
 Nevertheless, encoding is very fast
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29
Huffman coding - implementation
 However, spatial and temporal complexity of
the decoding phase are far more important,
because, on average, decoding will happen
more frequently.
 Consider a Huffman tree with n symbols
 n leafs and n-1 internal nodes

has the pointer to a symbol and
the info that it is a leaf
has two pointers
2 2( 1) 4 words (32 bits)
n n n
 
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30
Huffman coding - implementation
 1 million symbols 16 MB of memory!
 Moreover traversing a tree from root to leaf
involves follow a lot of pointers, with little
locality of reference. This causes several
page faults or cache misses.
 To solve this problem a variant of Huffman
coding has been proposed: canonical
Huffman coding
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 31
canonical Huffman coding - I
Symb. Prob. Code 1 Code 2 Code 3
0.11 000
0.12 001
0.13 100
111
1
000
001
0
10
01 10
0
1
a
b
c
d .14 101
0.24 01
0.26 11
010
10
00
011
10
1
1
e
f
b
0.12
c
0.13
d
0.14
e
0.24
f
0.26
a
0.11
0.23 0.27
0.47
0.53
1.0
0
0
0
0
0
1
1
1 1
1
(0)
(0)
(0)
(0)
(0)
(1)
(1)
(1)
(1) (1)
?
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 32
canonical Huffman coding - II
 This code cannot be obtained
through a Huffman tree!
 We do call it an Huffman code
because it is instantaneous and the
codeword lengths are the same than
a valid Huffman code
 numerical sequence property
 codewords with the same length are
ordered lexicographically
 when the codewords are sorted in lexical
order they are also in order from the
longest to the shortest codeword
Symb. Code 3
000
001
010
011
10
11
a
b
c
d
e
f
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 33
canonical Huffman coding - III
 The main advantage is that it is not necessary
to store a tree, in order to decoding
 We need
 a list of the symbols ordered according to the lexical
order of the codewords
 an array with the first codeword of each distinct
length
34
canonical Huffman coding - IV
Encoding. Suppose there are n disctinct symbols, that for symbol
i we have calculated huffman codelength and
i
l i
i l maxlength
 
for 1 to { [ ] 0; }
for 1 to { [ ] [ ] 1; }
[ ] 0;
for 1 downto 1 {
[ ] ( [ 1] [ 1])/ 2 ; }
for 1 to
i i
k maxlength numl k
i n numl l numl l
firstcode maxlength
k maxlength
firstcode k firstcode k numl k
k maxlength
 
  

 
   
 
 

 
{ [ ]= [ ]; }
for 1 to {
[ ] [ ];
, [ ]- [ ] ;
[ ] [ ] 1; }
i
i i i
i i
nextcode k firstcode k
i n
codeword i nextcode l
symbol l nextcode l firstcode l i
nextcode l nextcode l



 
numl[k] = number of
codewords with length k
firstcode[k] =
integer for first code of
length k
nextcode[k] =
integer for the next
codeword of length k to
be assigned
symbol[-,-] used for
decoding
codeword[i] the
rightmost bits of this
integer are the code for
symbol i
i
l
35
canonical Huffman - example
 1. Evaluate array numl
Symb. length
2
5
5
3
2
5
5
2
i
i l
a
b
c
d
e
f
g
h
: [0 3 1 0 4]
numl
 2. Evaluate array firstcode
: [2 1 1 2 0]
firstcode
 3. Construct array codeword and symbol
 
for 1 to {
[ ]= [ ]; }
for 1 to {
[ ] [ ];
, [ ]- [ ] ;
[ ] [ ] 1; }
i
i i i
i i
k maxlength
nextcode k firstcode k
i n
codeword i nextcode l
symbol l nextcode l firstcode l i
nextcode l nextcode l




 
- - - -
a e h -
d - - -
- - - -
b c f g
symbol
0 1 2 3
1
2
3
4
5
code bits
word
1 01
0 00000
1 00001
1 001
2 10
2 00010
3 00011
3 11
for 1 downto 1 {
[ ] ( [ 1]
[ 1]) / 2 ; }
k maxlength
firstcode k firstcode k
numl k
 
  
 
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 36
canonical Huffman coding - V
Decoding. We have the arrays firstcode and symbols
 
();
1;
while [ ] {
2* ();
1; }
Return , [ ] ;
v nextinputbit
k
v firstcode k
v v nextinputbit
k k
symbol k v firstcode k



 
 

nextinputbit() function that
returns next input bit
firstcode[k] = integer for first
code of length k
symbol[k,n] returns the
symbol number n with
codelength k
37
canonical Huffman - example
 
();
1;
while [ ] {
2* ();
1; }
Return , [ ] ;
v nextinputbit
k
v firstcode k
v v nextinputbit
k k
symbol k v firstcode k



 
 

- - - -
a e h -
d - - -
- - - -
b c f g
symbol
0 1 2 3
1
2
3
4
5
: [2 1 1 2 0]
firstcode
00 0
0 0
0 000 00
1
1 1
1 1
1
Decoded: dhebad
00 0
0 0
0 000 00
1
1 1
1 1
1
symbol[3,0] = d
symbol[2,2] = h
symbol[2,1] = e
symbol[5,0] = b
symbol[2,0] = a
symbol[3,0] = d
symbol[3,0] = d
symbol[2,2] = h
symbol[2,1] = e
symbol[5,0] = b
symbol[2,0] = a
symbol[3,0] = d

Huffman coding.ppt

  • 1.
  • 2.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I  A code is optimal if it has the shortest codeword length L  This can be seen as an optimization problem 1 m i i i L p l    1 1 min subject to 1 i m i i i m l i l p D      
  • 3.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 3 Optimal codes - II  Let’s make two simplifying assumptions  no integer constraint on the codelengths  Kraft inequality holds with equality  Lagrange-multiplier problem 1 1 1 i m m l i i i i J p l D                0 log 0 log j j l l j j j p J p D D D l D            
  • 4.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 4 Optimal codes - III  Substitute into the Kraft inequality that is Note that log j l j p D D    1 1 1 log log i m l i i i p p D D D           * log i D i l p   * * 1 1 log ( ) !! m m i i i D i i i D p l p p L H X         the entropy, when we use base D for logarithms
  • 5.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 5 Optimal codes - IV  In practice the codeword lengths must be integer value, so obtained results is a lower bound  Theorem The expected length of any istantaneous D-ary code for a r.v. X satisfies this fundamental result derives frow the work of Shannon ( ) D L H x 
  • 6.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 6 Optimal codes - V  What about the upper bound?  Theorem Given a source alphabet (i.e. a r.v.) of entropy it is possible to find an instantaneous binary code which length satisfies  A similar theorem could be stated if we use the wrong probabilities instead of the true ones ; the only difference is a term which accounts for the relative entropy ( ) H X ( ) ( ) 1 H X L H X      i p   i q
  • 7.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 7 The redundance  It is defined as the average codeword legths minus the entropy  Note that (why?) Redundancy log i i i L p p           0 redundancy 1  
  • 8.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 8 Compression ratio  It is the ratio between the average number of bit/symbol in the original message and the same quantity for the coded message, i.e. average original symbol length average compressed symbol length C      ( )!! L X 
  • 9.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 9 Uniquely decodable codes  The set of the instantaneous codes are a small subset of the uniquely decodable codes.  It is possible to obtain a lower average code length L using a uniquely decodable code that is not instantaneous? NO  So we use instantaneous codes that are easier to decode
  • 10.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 10 Summary  Average codeword length L  for uniquely decodable codes (and for instantaneous codes)  In practice for each r.v. with entropy we can build a code with average codeword length that satisfies ( ) L H X  ( ) H X X ( ) ( ) 1 H X L H X   
  • 11.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 11 Shannon-Fano coding  The main advantage of the Shannon-Fano technique is its semplicity  Source symbols are listed in order of nonincreasing probability.  The list is divided in such a way to form two groups of as nearly equal probabilities as possible  Each symbol in the first group receives a 0 as first digit of its codeword, while the others receive a 1  Each of these group is then divided according to the same criterion and additional code digits are appended  The process is continued until each group contains only one message
  • 12.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 12 example H=1.9375 bits L=1.9375 bits 1 2 1 4 1 8 1 16 1 32 1 32 a b c d e f 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1
  • 13.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 13 Shannon-Fano coding - exercise  Encode, using Shannon-Fano algorithm Symb. Prob. * 12% ? 5% ! 13% & 2% $ 29% € 13% § 10% ° 6% @ 10%
  • 14.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 14 Is Shannon-Fano coding optimal? H=2.2328 bits L=2.31 bits 0.35 0.17 0.17 0.16 0.15 a b c d e 00 01 10 110 111 0 100 101 110 111 L1=2.3 bits
  • 15.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 15 Huffman coding - I  There is another algorithm which performances are slightly better than Shanno-Fano, the famous Huffman coding  It works constructing bottom-up a tree, that has symbols in the leafs  The two leafs with the smallest probabilities becomes sibling under a parent node with probabilities equal to the two children’s probabilities
  • 16.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 16 Huffman coding - II  At this time the operation is repeated, considering also the new parent node and ignoring its children  The process continue until there is only parent node with probability 1, that is the root of the tree  Then the two branches for every non-leaf node are labeled 0 and 1 (typically, 0 on the left branch, but the order is not important)
  • 17.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 17 Huffman coding - example 0 Symbol Prob. 0.05 0.05 0.1 0.2 0.3 0.2 0.1 a b c d e f g a 0.05 b 0.05 c 0.1 d 0.2 e 0.3 f 0.2 g 0.1 0.1 0.2 0.3 0.4 0.6 1.0 0 0 0 0 0 1 1 1 1 1 1 a 0.05 b 0.05 c 0.1 d 0.2 e 0.3 f 0.2 g 0.1 0.1 0.2 0.3 0.4 0.6 1.0
  • 18.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 18 Huffman coding - example Exercise: evaluate H(X) and L(X) H(X)=2.5464 bits L(X)=2.6 bits !! Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11 a b c d e f 0 0.1 111 g
  • 19.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 19 Huffman coding - exercise  Code the sequence aeebcddegfced and calculate the compression ratio Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01 Aver. orig. symb. length = 3 bits Aver. compr. symb. length = 34/13 C=..... Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11 a b c d e f 0 0.1 111 g
  • 20.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 20 Huffman coding - exercise  Decode the sequence 0111001001000001111110 Sol: dfdcadgf Symbol Prob. Codeword 0.05 0000 0.05 0001 0.1 001 0.2 01 0.3 10 0.2 11 a b c d e f 0 0.1 111 g
  • 21.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 21 Huffman coding - exercise  Encode with Huffman the sequence 01$cc0a02ba10 and evaluate entropy, average codeword length and compression ratio Symb. Prob. 0.10 0.03 0.14 0 0.4 1 0.22 2 0.04 $ 0.07 a b c
  • 22.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 22 Huffman coding - exercise Symb. Prob. 0 0.16 1 0.02 2 0.15 3 0.29 4 0.17 5 0.04 % 0.17  Decode (if possible) the Huffman coded bit streaming 01001011010011110101...
  • 23.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 23 Huffman coding - notes  In the huffman coding, if, at any time, there is more than one way to choose a smallest pair of probabilities, any such pair may be chosen  Sometimes, the list of probabilities is inizialized to be non-increasing and reordered after each node creation. This details doesn’t affect the correctness of the algorithm, but it provides a more efficient implementation
  • 24.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 24 Huffman coding - notes  There are cases in which the Huffman coding does not uniquely determine codeword lengths, due to the arbitrary choice among equal minimum probabilities.  For example for a source with probabilities it is possible to obtain codeword lengths of and of  It would be better to have a code which codelength has the minimum variance, as this solution will need the minimum buffer space in the transmitter and in the receiver   0.4, 0.2, 0.2, 0.1, 0.1   1, 2, 3, 4, 4   2, 2, 2, 3, 3
  • 25.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 25 Huffman coding - notes  Schwarz defines a variant of the Huffman algorithm that allows to build the code with minimum .  There are several other variants, we will explain the most important in a while. max l
  • 26.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 26 Optimality of Huffman coding - I  It is possible to prove that, in case of character coding (one symbol, one codeword), Huffman coding is optimal  In another terms Huffman code has minimum redundancy  An upper bound for redundancy has been found where is the probability of the most likely simbol   1 2 2 2 1 redundancy 1 log log log 0.086 p e e p      1 p
  • 27.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 27 Optimality of Huffman coding - II  Why Huffman code “suffers” when there is one symbol with very high probability?  Remember the notion of uncertainty... The main problem is given by the integer constraint on codelengths!!  This consideration opens the way to a more powerful coding... we will see it later ( ) 1 log( ( )) 0 p x p x    
  • 28.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 28 Huffman coding - implementation  Huffman coding can be generated in O(n) time, where n is the number of source symbols, provided that probabilities have been presorted (however this sort costs O(nlogn)...)  Nevertheless, encoding is very fast
  • 29.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 29 Huffman coding - implementation  However, spatial and temporal complexity of the decoding phase are far more important, because, on average, decoding will happen more frequently.  Consider a Huffman tree with n symbols  n leafs and n-1 internal nodes  has the pointer to a symbol and the info that it is a leaf has two pointers 2 2( 1) 4 words (32 bits) n n n  
  • 30.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 30 Huffman coding - implementation  1 million symbols 16 MB of memory!  Moreover traversing a tree from root to leaf involves follow a lot of pointers, with little locality of reference. This causes several page faults or cache misses.  To solve this problem a variant of Huffman coding has been proposed: canonical Huffman coding
  • 31.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 31 canonical Huffman coding - I Symb. Prob. Code 1 Code 2 Code 3 0.11 000 0.12 001 0.13 100 111 1 000 001 0 10 01 10 0 1 a b c d .14 101 0.24 01 0.26 11 010 10 00 011 10 1 1 e f b 0.12 c 0.13 d 0.14 e 0.24 f 0.26 a 0.11 0.23 0.27 0.47 0.53 1.0 0 0 0 0 0 1 1 1 1 1 (0) (0) (0) (0) (0) (1) (1) (1) (1) (1) ?
  • 32.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 32 canonical Huffman coding - II  This code cannot be obtained through a Huffman tree!  We do call it an Huffman code because it is instantaneous and the codeword lengths are the same than a valid Huffman code  numerical sequence property  codewords with the same length are ordered lexicographically  when the codewords are sorted in lexical order they are also in order from the longest to the shortest codeword Symb. Code 3 000 001 010 011 10 11 a b c d e f
  • 33.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 33 canonical Huffman coding - III  The main advantage is that it is not necessary to store a tree, in order to decoding  We need  a list of the symbols ordered according to the lexical order of the codewords  an array with the first codeword of each distinct length
  • 34.
    34 canonical Huffman coding- IV Encoding. Suppose there are n disctinct symbols, that for symbol i we have calculated huffman codelength and i l i i l maxlength   for 1 to { [ ] 0; } for 1 to { [ ] [ ] 1; } [ ] 0; for 1 downto 1 { [ ] ( [ 1] [ 1])/ 2 ; } for 1 to i i k maxlength numl k i n numl l numl l firstcode maxlength k maxlength firstcode k firstcode k numl k k maxlength                    { [ ]= [ ]; } for 1 to { [ ] [ ]; , [ ]- [ ] ; [ ] [ ] 1; } i i i i i i nextcode k firstcode k i n codeword i nextcode l symbol l nextcode l firstcode l i nextcode l nextcode l      numl[k] = number of codewords with length k firstcode[k] = integer for first code of length k nextcode[k] = integer for the next codeword of length k to be assigned symbol[-,-] used for decoding codeword[i] the rightmost bits of this integer are the code for symbol i i l
  • 35.
    35 canonical Huffman -example  1. Evaluate array numl Symb. length 2 5 5 3 2 5 5 2 i i l a b c d e f g h : [0 3 1 0 4] numl  2. Evaluate array firstcode : [2 1 1 2 0] firstcode  3. Construct array codeword and symbol   for 1 to { [ ]= [ ]; } for 1 to { [ ] [ ]; , [ ]- [ ] ; [ ] [ ] 1; } i i i i i i k maxlength nextcode k firstcode k i n codeword i nextcode l symbol l nextcode l firstcode l i nextcode l nextcode l       - - - - a e h - d - - - - - - - b c f g symbol 0 1 2 3 1 2 3 4 5 code bits word 1 01 0 00000 1 00001 1 001 2 10 2 00010 3 00011 3 11 for 1 downto 1 { [ ] ( [ 1] [ 1]) / 2 ; } k maxlength firstcode k firstcode k numl k       
  • 36.
    Gabriele Monfardini -Corso di Basi di Dati Multimediali a.a. 2005-2006 36 canonical Huffman coding - V Decoding. We have the arrays firstcode and symbols   (); 1; while [ ] { 2* (); 1; } Return , [ ] ; v nextinputbit k v firstcode k v v nextinputbit k k symbol k v firstcode k         nextinputbit() function that returns next input bit firstcode[k] = integer for first code of length k symbol[k,n] returns the symbol number n with codelength k
  • 37.
    37 canonical Huffman -example   (); 1; while [ ] { 2* (); 1; } Return , [ ] ; v nextinputbit k v firstcode k v v nextinputbit k k symbol k v firstcode k         - - - - a e h - d - - - - - - - b c f g symbol 0 1 2 3 1 2 3 4 5 : [2 1 1 2 0] firstcode 00 0 0 0 0 000 00 1 1 1 1 1 1 Decoded: dhebad 00 0 0 0 0 000 00 1 1 1 1 1 1 symbol[3,0] = d symbol[2,2] = h symbol[2,1] = e symbol[5,0] = b symbol[2,0] = a symbol[3,0] = d symbol[3,0] = d symbol[2,2] = h symbol[2,1] = e symbol[5,0] = b symbol[2,0] = a symbol[3,0] = d