CS/EE 5590 / ENG 401 Special Topics
(Class Ids: 17804, 17815, 17803)
Lec 02
Entropy and Lossless Coding I
Zhu Li
Z. Li Multimedia Communciation, 2016 Spring p.1
Outline
 Lecture 01 ReCap
 Info Theory on Entropy
 Lossless Entropy Coding
Z. Li Multimedia Communciation, 2016 Spring p.2
Video Compression in Summary
Z. Li Multimedia Communciation, 2016 Spring p.3
Video Coding Standards: Rate-Distortion Performance
 Pre-HEVC
Z. Li Multimedia Communciation, 2016 Spring p.4
PSS over managed IP networks
 Managed mobile core IP networks
Z. Li Multimedia Communciation, 2016 Spring p.5
MPEG DASH – OTT
 HTTP Adaptive Streaming of Video
Z. Li Multimedia Communciation, 2016 Spring p.6
Outline
 Lecture 01 ReCap
 Info Theory on Entropy
 Self Info of an event
 Entropy of the source
 Relative Entropy
 Mutual Info
 Entropy Coding
Thanks for SFU’s Prof. Jie Liang’s slides!
Z. Li Multimedia Communciation, 2016 Spring p.7
Entropy and its Application
Entropy coding: the last part of a compression system
Losslessly represent symbols
Key idea:
 Assign short codes for common symbols
 Assign long codes for rare symbols
Question:
 How to evaluate a compression method?
o Need to know the lower bound we can achieve.
o  Entropy
Entropy
coding
QuantizationTransform
Encoder
0100100101111
Z. Li Multimedia Communciation, 2016 Spring p.8
Claude Shannon: 1916-2001
 A distant relative of Thomas Edison
 1932: Went to University of Michigan.
 1937: Master thesis at MIT became the foundation of
digital circuit design:
o “The most important, and also the most famous,
master's thesis of the century“
 1940: PhD, MIT
 1940-1956: Bell Lab (back to MIT after that)
 1948: The birth of Information Theory
o A mathematical theory of communication, Bell System
Technical Journal.
Z. Li Multimedia Communciation, 2016 Spring p.9
Axiom Definition of Information
Information is a measure of uncertainty or surprise
 Axiom 1:
 Information of an event is a function of its probability:
i(A) = f (P(A)). What’s the expression of f()?
 Axiom 2:
 Rare events have high information content
 Water found on Mars!!!
 Common events have low information content
 It’s raining in Vancouver.
Information should be a decreasing function of the probability:
Still numerous choices of f().
 Axiom 3:
 Information of two independent events = sum of individual information:
If P(AB)=P(A)P(B)  i(AB) = i(A) + i(B).
 Only the logarithmic function satisfies these conditions.
Z. Li Multimedia Communciation, 2016 Spring p.10
Self-information
)(log
)(
1
log)( xp
xp
xi bb 
• Shannon’s Definition [1948]:
• X: discrete random variable with alphabet {A1, A2, …, AN}
• Probability mass function: p(x) = Pr{ X = x}
• Self-information of an event X = x:
If b = 2, unit of information is bit
Self information indicates the number of bits
needed to represent an event.
1
P(x)
)(log xPb
0
Z. Li Multimedia Communciation, 2016 Spring p.11
 Recall: the mean of a function g(X):
Entropy is the expected self-information of the r.v. X:
 The entropy represents the minimal number of bits needed to
losslessly represent one output of the source.
Entropy of a Random Variable

x xp
xpXH
)(
1
log)()(
)g()())(()( xxpXgE xp 
 )(log
)(
1
log )()( XpE
Xp
EH xpxp 






Also write as H (p): function of the distribution of X, not the value of X.
Z. Li Multimedia Communciation, 2016 Spring p.12
Example
P(X=0) = 1/2
P(X=1) = 1/4
P(X=2) = 1/8
P(X=3) = 1/8
Find the entropy of X.
Solution:
1
( ) ( )log
( )
1 1 1 1 1 2 3 3 7
log 2 log 4 log8 log8 bits/sample.
2 4 8 8 2 4 8 8 4
x
H X p x
p x

        

Z. Li Multimedia Communciation, 2016 Spring p.13
Example
A binary source: only two possible outputs: 0, 1
 Source output example: 000101000101110101……
 p(X=0) = p, p(X=1)= 1 – p.
Entropy of X:
 H(p) = p (-log2(p) ) + (1-p) (-log2(1-p))
 H = 0 when p = 0 or p =1
oFixed output, no information
 H is largest when p = 1/2
oHighest uncertainty
oH = 1 bit in this case
Properties:
 H ≥ 0
 H concave (proved later) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
p
Entropy
Equal prob maximize entropy
Z. Li Multimedia Communciation, 2016 Spring p.14
Joint entropy
1 1 2 2( , , )n np X i X i X i  
• We can get better understanding of the source S by looking at a block
of output X1X2…Xn:
• The joint probability of a block of output:
 Joint entropy
1 2
1 2
1 1 2 2
1 1 2 2
( , , )
1
( , , )log
( , , )n
n
n n
i i i n n
H X X X
p X i X i X i
p X i X i X i

  
  
 

 

 Joint entropy is the number of bits required to represent the
sequence X1X2…Xn:
 This is the lower bound for entropy coding.
 ),...(log 1 nXXpE
Z. Li Multimedia Communciation, 2016 Spring p.15
Conditional Entropy
1 ( )
( | ) log log
( | ) ( , )
p y
i x y
p x y p x y
 
• Conditional Self-Information of an event X = x, given that
event Y = y has occurred:
 ( , )
( | ) ( ) ( | ) ( ) ( | )log( ( | ))
( | ) ( )log( ( | )) ( , )log( ( | ))
log( ( | )
x x y
x y x y
p x y
H Y X p x H Y X x p x p y x p y x
p y x p x p y x p x y p y x
E p y x
   
   
 
  
 
 Conditional Entropy H(Y | X): Average cond. self-info.
Remaining uncertainty about Y given the knowledge of X.
Note: p(x | y), p(x, y) and p(y) are three different distributions:
p1(x | y), p2(x, y) and p3(y).
Z. Li Multimedia Communciation, 2016 Spring p.16
Conditional Entropy
Example: for the following joint distribution p(x, y), find H(Y |
X).
1 2 3 4
1 1/8 1/16 1/32 1/32
2 1/16 1/8 1/32 1/32
3 1/16 1/16 1/16 1/16
4 1/4 0 0 0
Y
X
( | ) ( ) ( | )log( ( | )) ( , )log( ( | ))
x y x y
H Y X p x p y x p y x p x y p y x     
Need to find conditional prob p(y | x)
( , )
( | )
( )
p x y
p y x
p x
 Need to find marginal prob p(x) first (sum columns).
P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits
P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits
H(X|Y) = ∑ = ( | = )
= ¼ H(1/2 ¼ 1/8 1/8 )
+ 1/4H(1/4, ½, 1/8 ,1/8)
+ 1/4H(1/4 ¼ ¼ ¼ ) +
1/4H(1 0 0 0)
= 11/8 bits
Z. Li Multimedia Communciation, 2016 Spring p.17
Chain Rule
H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
Proof:
H(X) H(Y)
H(X | Y) H(Y | X)
Total area: H(X, Y)
 








x
x yx y
x y
x y
XYHXHXYHxpxp
xypyxpxpyxp
xypxpyxp
yxpyxpYXH
).|()()|()(log)(
)|(log),()(log),(
)|()(log),(
),(log),(),(
Simpler notation:
)|()(
))|(log)((log)),((log),(
XYHXH
XYpXpEYXpEYXH


Z. Li Multimedia Communciation, 2016 Spring p.18
Conditional Entropy
Example: for the following joint distribution p(x, y), find H(Y |
X).
 Indeed, H(X|Y) = H(X, Y) – H(Y)= 27/8 – 2 = 11/8 bits
1 2 3 4
1 1/8 1/16 1/32 1/32
2 1/16 1/8 1/32 1/32
3 1/16 1/16 1/16 1/16
4 1/4 0 0 0
Y
X
P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits
P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits
H(X|Y) = ∑ = ( | = )
= ¼ H(1/2 ¼ 1/8 1/8 )
+ 1/4H(1/4, ½, 1/8 ,1/8)
+ 1/4H(1/4 ¼ ¼ ¼ ) +
1/4H(1 0 0 0)
= 11/8 bits
Z. Li Multimedia Communciation, 2016 Spring p.19
Chain Rule
 H(X,Y) = H(X) + H(Y|X)
 Corollary: H(X, Y | Z) = H(X | Z) + H(Y | X, Z)
Note that: ( , | ) ( | , ) ( | )p x y z p y x z p x z
(Multiply by p(z) at both sides, we get )( , , ) ( | , ) ( , )p x y z p y x z p x z
))|,((log)|,(log),,(
)|,(log)|,()()|,(
ZYXpEzyxpzyxp
zyxpzyxpzpZYXH
x y z
x yz




Proof:
),|()|(
)),|((log))|((log)|,(
ZXYHZXH
ZXYpEZXpEZYXH


Z. Li Multimedia Communciation, 2016 Spring p.20
General Chain Rule
General form of chain rule:
)...|(),...,( 1,1
1
21 XXXHXXXH i
n
i
in 


 The joint encoding of a sequence can be broken into the
sequential encoding of each sample, e.g.
H(X1, X2, X3)=H(X1) + H(X2|X1) + H(X3|X2, X1)
 Advantages:
 Joint encoding needs joint probability: difficult
 Sequential encoding only needs conditional entropy,
can use local neighbors to approximate the conditional entropy
 context-adaptive arithmetic coding.
Adding H(Z):  H(X, Y | Z) + H(z) = H(X, Y, Z)
= H(z) + H(X | Z) + H(Y | X, Z)
Z. Li Multimedia Communciation, 2016 Spring p.21
General Chain Rule
),...|()...|()(),...( 111211  nnn xxxpxxpxpxxpProof:
.),...|(
),...|(log),...(
),...|(log),...(
),...|(log),...(
),...(log),...(),...(
1
11
1 ,...1
111
,...1
11
1
1
,...1 1
111
,...1
111

 
 
 














n
i
ii
n
i xnx
iin
xnx
ii
n
i
n
xnx
n
i
iin
xnx
nnn
XXXH
xxxpxxp
xxxpxxp
xxxpxxp
xxpxxpXXH
Z. Li Multimedia Communciation, 2016 Spring p.22
General Chain Rule
1 1( | ,... )i ip x x x 
The complexity of the conditional probability
grows as the increase of i.
In many cases we can approximate the cond.
probability with some nearest neighbors (contexts):
1 1 1( | ,... ) ( | ,... )i i i i L ip x x x p x x x  
 The low-dim cond prob is more manageable
 How to measure the quality of the approximation?
 Relative entropy
0 1 1 0 1 0 1
a b c b c a b
c b a b c b a
Z. Li Multimedia Communciation, 2016 Spring p.23
Relative Entropy – Cost of Coding with Wrong Distr
Also known as Kullback Leibler (K-L) Distance, Information
Divergence, Information Gain
A measure of the “distance” between two distributions:
 In many applications, the true distribution p(X) is unknown, and we
only know an estimation distribution q(X)
 What is the inefficiency in representing X?
o The true entropy:
o The actual rate:
o The difference:






  )(
)(
log
)(
)(
log)()||(
Xq
Xp
E
xq
xp
xpqpD p
x
1 ( )log ( )
x
R p x p x 
2 ( )log ( )
x
R p x q x 
2 1 ( || )R R D p q 
Z. Li Multimedia Communciation, 2016 Spring p.24
Relative Entropy
Properties:






  )(
)(
log
)(
)(
log)()||(
Xq
Xp
E
xq
xp
xpqpD p
x
( || ) 0.D p q 
( || ) 0 if and only if q = p.D p q 
 What if p(x)>0, but q(x)=0 for some x?  D(p||q)=∞
 Caution: D(p||q) is not a true distance
 Not symmetric in general: D(p || q) ≠ D(q || p)
 Does not satisfy triangular inequality.
Proved later.
Z. Li Multimedia Communciation, 2016 Spring p.25
Relative Entropy
How to make it symmetric?
 Many possibilities, for example:
 
1
( || ) ( || )
2
D p q D q p
( || ) ( || )D p q D q p
 can be useful for pattern classification.
)||(
1
)||(
1
pqDqpD

Z. Li Multimedia Communciation, 2016 Spring p.26
Mutual Information
i (x | y): conditional self-information
)()(
),(
log
)(
)|(
log)|()();(
ypxp
yxp
xp
yxp
yxixiyxi 
Note: i(x; y) can be negative, if p(x | y) < p(x).
 Mutual information between two events:
i(x | y) = -log p(x | y)
 A measure of the amount of information that one
event contains about another one.
 or the reduction in the uncertainty of one event
due to the knowledge of the other.
Z. Li Multimedia Communciation, 2016 Spring p.27
Mutual Information
I(X; Y): Mutual information between two random variables:
  ( , )
( , )
( ; ) ( , ) ( ; ) ( , )log
( ) ( )
( , )
D ( , ) || ( ) ( ) log
( ) ( )
x y x y
p x y
p x y
I X Y p x y i x y p x y
p x p y
p X Y
p x y p x p y E
p X p Y
 
 
   
 
 
But it is symmetric: I(X; Y) = I(Y; X)
 Mutual information is a relative entropy:
 If X, Y are independent: p(x, y) = p(x) p(y)
 I (X; Y) = 0
 Knowing X does not reduce the uncertainty of Y.
Different from i(x; y), I(X; Y) >=0 (due to averaging)
Z. Li Multimedia Communciation, 2016 Spring p.28
Entropy and Mutual Information
( , ) ( | )
( ; ) ( , )log ( , )log
( ) ( ) ( )
( , )log ( | ) ( , )log ( )
( ) ( | )
x y x y
x y x y
p x y p x y
I X Y p x y p x y
p x p y p x
p x y p x y p x y p x
H X H X Y
 
 
 
 
 
2. Similarly: ( ; ) ( ) ( | )I X Y H Y H Y X 
1.
3. I(X; Y) = H(X) + H(Y) – H(X, Y)
Proof: Expand the definition:
( ; ) ( ) ( | )I X Y H X H X Y 
 
),()()(
)(log)(log),(log),();(
YXHYHXH
ypxpyxpyxpYXI
x y

 
Z. Li Multimedia Communciation, 2016 Spring p.29
Entropy and Mutual Information
H(X) H(Y)
I(X; Y)H(X | Y) H(Y | X)
Total area: H(X, Y)
It can be seen from this figure that I(X; X) = H(X):
Proof:
Let X = Y in I(X; Y) = H(X) + H(Y) – H(X, Y),
or in I(X; Y) = H(X) – H(X | Y) (and use H(X|X)=0).
Z. Li Multimedia Communciation, 2016 Spring p.30
Application of Mutual Information
a b c b c a b
c b a b c b a
Mutual information can be used in the optimization of
context quantization.
Example: If each neighbor has 26 possible values (a to z),
then 5 neighbors have 265 combinations:  too many
cond probs to estimate.
To reduce the number, can group similar data pattern
together  context quantization
  1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x 
Z. Li Multimedia Communciation, 2016 Spring p.31
Application of Mutual Information
We need to design the function f( ) to minimize the
conditional entropy
  1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x 
)...|(),...,( 1,1
1
21 XXXHXXXH i
n
i
in 


( | ) ( ) ( ; )H X Y H X I X Y But
The problem is equivalent to maximizing the mutual
information between Xi and f(x1, … xi-1).
))...(|( 1,1 XXfXH ii 
For further info: Liu and Karam, Mutual Information-Based Analysis
of JPEG2000 Contexts, IEEE Trans Image Processing, VOL. 14, NO. 4, APRIL
2005, pp. 411-422.
Z. Li Multimedia Communciation, 2016 Spring p.32
Outline
 Lecture 01 ReCap
 Info Theory on Entropy
 Entropy Coding
 Prefix Coding
 Kraft-McMillan Inequality
 Shannon Codes
Z. Li Multimedia Communciation, 2016 Spring p.33
Variable Length Coding
Design the mapping from source symbols to codewords
Lossless mapping
Different codewords may have different lengths
Goal: minimizing the average codeword length
The entropy is the lower bound.
Z. Li Multimedia Communciation, 2016 Spring p.34
Classes of Codes
Non-singular code: Different inputs are mapped to different
codewords (invertible).
Uniquely decodable code: any encoded string has only one possible
source string, but may need delay to decode.
Prefix-free code (or simply prefix, or instantaneous):
No codeword is a prefix of any other codeword.
 The focus of our studies.
 Questions:
o Characteristic?
o How to design?
o Is it optimal?
All codes
Non-singular codes
Uniquely decodable
codes
Prefix-free codes
Z. Li Multimedia Communciation, 2016 Spring p.35
Prefix Code
 Examples
X
Singular Non-singular,
But not uniquely
decodable
Uniquely
decodable, but not
prefix-free
Prefix-free
1 0 0 0 0
2 0 010 01 10
3 0 01 011 110
4 0 10 0111 111
Need punctuation ……01011…
Need to look at next
bit to decode previous code.
Z. Li Multimedia Communciation, 2016 Spring p.36
Carter-Gill’s Conjecture [1974]
Carter-Gill’s Conjecture [1974]
 Every uniquely decodable code can be replaced by a prefix-free code
with the same set of codeword compositions.
 So we only need to study prefix-free code.
Z. Li Multimedia Communciation, 2016 Spring p.37
Prefix-free Code
Can be uniquely decoded.
No codeword is a prefix of another one.
Also called prefix code
Goal: construct prefix code with minimal expected length.
Can put all codewords in a binary tree:
0 1
0 1
0 1
0
10
110 111
Root node
leaf node
Internal node
 Prefix-free code contains leaves only.
 How to express the requirement mathematically?
Z. Li Multimedia Communciation, 2016 Spring p.38
Kraft-McMillan Inequality
12
1


N
i
li
• The codeword lengths li, i=1,…N of a prefix code over an
alphabet of size D(=2) satisfies the inequality
Conversely, if a set of {li} satisfies the inequality
above, then there exists a prefix code with codeword
lengths li, i=1,…N.
 The characteristic of prefix-free codes:
Z. Li Multimedia Communciation, 2016 Spring p.39
Kraft-McMillan Inequality
2212
11
L
N
i
lL
N
i
l ii
  



 Consider D=2: expand the binary code tree to full depth
L = max(li)
0
10
110
111
 Number of nodes in the last level:
 Each code corresponds to a sub-tree:
 The number of off springs in the last level:
 K-M inequality:
# of L-th level offsprings of all codes is less than 2^L !
ilL
2
L
2
L = 3
Example: {0, 10, 110, 111}
Z. Li Multimedia Communciation, 2016 Spring p.40
2^3=8
{4, 2, 0, 0}
Kraft-McMillan Inequality
0
10
110 111
11
Invalid code: {0, 10, 11, 110, 111}
Leads to more than
2^L offspring: 12> 23
12
1




i
li
 K-M inequality:
Z. Li Multimedia Communciation, 2016 Spring p.41
Extended Kraft Inequality
Countably infinite prefix code also satisfies the Kraft
inequality:
 Has infinite number of codewords.
Example:
 0, 10, 110, 1110, 11110, 111……10, ……
(Golomb-Rice code, next lecture)
 Each codeword can be mapped to a subinterval in [0, 1] that is
disjoint with others (revisited in arithmetic coding)
1
1




i
li
D
)
0
)
10
)
0 0.5 0.75 0.875 1
110 ……
Z. Li Multimedia Communciation, 2016 Spring p.42
0
10
110
….
L = 3
Optimal Codes (Advanced Topic)
How to design the prefix code with the minimal expected
length?
Optimization Problem: find {li} to
1..
min


 i
i
l
ii
l
Dts
lp
 Lagrangian solution:
 Ignore the integer codeword length constraint for now
 Assume equality holds in the Kraft inequality
 
 il
ii DlpJ Minimize
Z. Li Multimedia Communciation, 2016 Spring p.43
Optimal Codes
 
 il
ii DlpJ 
  0lnLet 

  il
i
i
DDp
l
J

 D
p
D ili
ln

1intongSubstituti   il
D
Dln
1

i
l
pD i

or
iDi pl log-
*

The optimal codeword length is the self-information of an event.
Expected codeword length:
)(log*
XHpplpL DiDiii    Entropy of X !
Z. Li Multimedia Communciation, 2016 Spring p.44
Optimal Code
Theorem: The expected length L of any prefix code is greater or equal
to the entropy
with equality holds iff
is not integer in general.iDi p-l log
*

( )DL H X
 Proof:
( ) log
log log logi
i
D i i i D i
l i
i D i D i i D l
L H X p l p p
p
p D p p p
D
  
  
 
  
This reminds us the definition of relative entropy D(p ||q), but we need to
normalize D-li.
i
l
pD i

Z. Li Multimedia Communciation, 2016 Spring p.45
Pi is Diadic, ½, ¼, 1/8, 1/16…
Optimal Code
 Let distribution be diadic, i.e, = / ∑
because D(p||q) >= 0, and
1
1 for prefix code.i
N
l
i
D


The equality holds iff both terms are 0: 
i
l
pD i

or logD ip is an integer.
Z. Li Multimedia Communciation, 2016 Spring p.46
− = log = log + log 1/( )
= ( | + log
1
∑
Optimal Code
D-adic: a probability distribution is called D-adic with respect to
D if each probability is equal to D-n for some integer n:
 Example: {1/2, 1/4, 1/8, 1/8}
Therefore the optimality can be achieved by prefix code iff the
distribution is D-adic.
 Previous example:
 Possible codewords:
o {0, 10, 110, 111}
log {1,2,3,3}D ip 
Z. Li Multimedia Communciation, 2016 Spring p.47
Shannon Code: Bounds on Optimal Code
is not integer in general. iDi p-l log
*

Practical codewords have to be integer.







i
Di
p
l
1
logShannon Code:
Is this a valid prefix code? Check Kraft inequality
.1
1
log
1
log
 








i
ppl
pDDD i
D
i
D
i
1
1
log
1
log 
i
Di
i
D
p
l
p






i
Di
p
l
1
log
1)()(  XHLXH DD
Yes !
This is just one choice. May not be optimal (see example later)
Z. Li Multimedia Communciation, 2016 Spring p.48
Optimal Code
The optimal code with integer lengths should be better than
Shannon code
1)()( *
 XHLXH DD
 To reduce the overhead per symbol:
 Encode a block of symbols {x1, x2, …, xn} together
 ),...,,(
1
),...,,(),...,,(
1
212121 nnnn xxxlE
n
xxxlxxxp
n
L  
  1),...,,(),...,,(),...,,( 212121  nnn XXXHxxxlEXXXH
Assume i.i.d. samples: )(),...,,( 21 XnHXXXH n 
n
XHLXH n
1
)()(  )(XHLn  if stationary.
(entropy rate)
Z. Li Multimedia Communciation, 2016 Spring p.49
Optimal Code
Impact of wrong pdf: what’s the penalty if the pdf we use is different
from the true pdf?
True pdf: p(x) Codeword length: l(x)
Estimated pdf: q(x) Expected length: Epl(X)
1)||()()()||()(  qpDpHXlEqpDpH p
Proof: assume Shannon code 






)(
1
log)(
xq
xl
 











 1
)(
1
log)(
)(
1
log)()(
xq
xp
xq
xpXlEp
1)||()(1
)(
1
)(
)(
log)( 













  qpDXH
xpxq
xp
xp
The lower bound is derived similarly.
Z. Li Multimedia Communciation, 2016 Spring p.50
Shannon Code is not optimal
Example:
 Binary r.v. X: p(0)=0.9999, p(1)=0.0001.
Entropy: 0.0015 bits/sample
Assign binary codewords by Shannon code:







)(
1
log)(
xp
xl
.1
9999.0
1
log2 



.14
0001.0
1
log2 





 Expected length: 0.9999 x 1+ 0.0001 x 14 = 1.0013.
 Within the range of [H(X), H(X) + 1].
 But we can easily beat this by the code {0, 1}
 Expected length: 1.
Z. Li Multimedia Communciation, 2016 Spring p.51
Q&A
Z. Li Multimedia Communciation, 2016 Spring p.52

Multimedia Communication Lec02: Info Theory and Entropy

  • 1.
    CS/EE 5590 /ENG 401 Special Topics (Class Ids: 17804, 17815, 17803) Lec 02 Entropy and Lossless Coding I Zhu Li Z. Li Multimedia Communciation, 2016 Spring p.1 Outline  Lecture 01 ReCap  Info Theory on Entropy  Lossless Entropy Coding Z. Li Multimedia Communciation, 2016 Spring p.2 Video Compression in Summary Z. Li Multimedia Communciation, 2016 Spring p.3 Video Coding Standards: Rate-Distortion Performance  Pre-HEVC Z. Li Multimedia Communciation, 2016 Spring p.4
  • 2.
    PSS over managedIP networks  Managed mobile core IP networks Z. Li Multimedia Communciation, 2016 Spring p.5 MPEG DASH – OTT  HTTP Adaptive Streaming of Video Z. Li Multimedia Communciation, 2016 Spring p.6 Outline  Lecture 01 ReCap  Info Theory on Entropy  Self Info of an event  Entropy of the source  Relative Entropy  Mutual Info  Entropy Coding Thanks for SFU’s Prof. Jie Liang’s slides! Z. Li Multimedia Communciation, 2016 Spring p.7 Entropy and its Application Entropy coding: the last part of a compression system Losslessly represent symbols Key idea:  Assign short codes for common symbols  Assign long codes for rare symbols Question:  How to evaluate a compression method? o Need to know the lower bound we can achieve. o  Entropy Entropy coding QuantizationTransform Encoder 0100100101111 Z. Li Multimedia Communciation, 2016 Spring p.8
  • 3.
    Claude Shannon: 1916-2001 A distant relative of Thomas Edison  1932: Went to University of Michigan.  1937: Master thesis at MIT became the foundation of digital circuit design: o “The most important, and also the most famous, master's thesis of the century“  1940: PhD, MIT  1940-1956: Bell Lab (back to MIT after that)  1948: The birth of Information Theory o A mathematical theory of communication, Bell System Technical Journal. Z. Li Multimedia Communciation, 2016 Spring p.9 Axiom Definition of Information Information is a measure of uncertainty or surprise  Axiom 1:  Information of an event is a function of its probability: i(A) = f (P(A)). What’s the expression of f()?  Axiom 2:  Rare events have high information content  Water found on Mars!!!  Common events have low information content  It’s raining in Vancouver. Information should be a decreasing function of the probability: Still numerous choices of f().  Axiom 3:  Information of two independent events = sum of individual information: If P(AB)=P(A)P(B)  i(AB) = i(A) + i(B).  Only the logarithmic function satisfies these conditions. Z. Li Multimedia Communciation, 2016 Spring p.10 Self-information )(log )( 1 log)( xp xp xi bb  • Shannon’s Definition [1948]: • X: discrete random variable with alphabet {A1, A2, …, AN} • Probability mass function: p(x) = Pr{ X = x} • Self-information of an event X = x: If b = 2, unit of information is bit Self information indicates the number of bits needed to represent an event. 1 P(x) )(log xPb 0 Z. Li Multimedia Communciation, 2016 Spring p.11  Recall: the mean of a function g(X): Entropy is the expected self-information of the r.v. X:  The entropy represents the minimal number of bits needed to losslessly represent one output of the source. Entropy of a Random Variable  x xp xpXH )( 1 log)()( )g()())(()( xxpXgE xp   )(log )( 1 log )()( XpE Xp EH xpxp        Also write as H (p): function of the distribution of X, not the value of X. Z. Li Multimedia Communciation, 2016 Spring p.12
  • 4.
    Example P(X=0) = 1/2 P(X=1)= 1/4 P(X=2) = 1/8 P(X=3) = 1/8 Find the entropy of X. Solution: 1 ( ) ( )log ( ) 1 1 1 1 1 2 3 3 7 log 2 log 4 log8 log8 bits/sample. 2 4 8 8 2 4 8 8 4 x H X p x p x            Z. Li Multimedia Communciation, 2016 Spring p.13 Example A binary source: only two possible outputs: 0, 1  Source output example: 000101000101110101……  p(X=0) = p, p(X=1)= 1 – p. Entropy of X:  H(p) = p (-log2(p) ) + (1-p) (-log2(1-p))  H = 0 when p = 0 or p =1 oFixed output, no information  H is largest when p = 1/2 oHighest uncertainty oH = 1 bit in this case Properties:  H ≥ 0  H concave (proved later) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 p Entropy Equal prob maximize entropy Z. Li Multimedia Communciation, 2016 Spring p.14 Joint entropy 1 1 2 2( , , )n np X i X i X i   • We can get better understanding of the source S by looking at a block of output X1X2…Xn: • The joint probability of a block of output:  Joint entropy 1 2 1 2 1 1 2 2 1 1 2 2 ( , , ) 1 ( , , )log ( , , )n n n n i i i n n H X X X p X i X i X i p X i X i X i               Joint entropy is the number of bits required to represent the sequence X1X2…Xn:  This is the lower bound for entropy coding.  ),...(log 1 nXXpE Z. Li Multimedia Communciation, 2016 Spring p.15 Conditional Entropy 1 ( ) ( | ) log log ( | ) ( , ) p y i x y p x y p x y   • Conditional Self-Information of an event X = x, given that event Y = y has occurred:  ( , ) ( | ) ( ) ( | ) ( ) ( | )log( ( | )) ( | ) ( )log( ( | )) ( , )log( ( | )) log( ( | ) x x y x y x y p x y H Y X p x H Y X x p x p y x p y x p y x p x p y x p x y p y x E p y x                 Conditional Entropy H(Y | X): Average cond. self-info. Remaining uncertainty about Y given the knowledge of X. Note: p(x | y), p(x, y) and p(y) are three different distributions: p1(x | y), p2(x, y) and p3(y). Z. Li Multimedia Communciation, 2016 Spring p.16
  • 5.
    Conditional Entropy Example: forthe following joint distribution p(x, y), find H(Y | X). 1 2 3 4 1 1/8 1/16 1/32 1/32 2 1/16 1/8 1/32 1/32 3 1/16 1/16 1/16 1/16 4 1/4 0 0 0 Y X ( | ) ( ) ( | )log( ( | )) ( , )log( ( | )) x y x y H Y X p x p y x p y x p x y p y x      Need to find conditional prob p(y | x) ( , ) ( | ) ( ) p x y p y x p x  Need to find marginal prob p(x) first (sum columns). P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits H(X|Y) = ∑ = ( | = ) = ¼ H(1/2 ¼ 1/8 1/8 ) + 1/4H(1/4, ½, 1/8 ,1/8) + 1/4H(1/4 ¼ ¼ ¼ ) + 1/4H(1 0 0 0) = 11/8 bits Z. Li Multimedia Communciation, 2016 Spring p.17 Chain Rule H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) Proof: H(X) H(Y) H(X | Y) H(Y | X) Total area: H(X, Y)           x x yx y x y x y XYHXHXYHxpxp xypyxpxpyxp xypxpyxp yxpyxpYXH ).|()()|()(log)( )|(log),()(log),( )|()(log),( ),(log),(),( Simpler notation: )|()( ))|(log)((log)),((log),( XYHXH XYpXpEYXpEYXH   Z. Li Multimedia Communciation, 2016 Spring p.18 Conditional Entropy Example: for the following joint distribution p(x, y), find H(Y | X).  Indeed, H(X|Y) = H(X, Y) – H(Y)= 27/8 – 2 = 11/8 bits 1 2 3 4 1 1/8 1/16 1/32 1/32 2 1/16 1/8 1/32 1/32 3 1/16 1/16 1/16 1/16 4 1/4 0 0 0 Y X P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits H(X|Y) = ∑ = ( | = ) = ¼ H(1/2 ¼ 1/8 1/8 ) + 1/4H(1/4, ½, 1/8 ,1/8) + 1/4H(1/4 ¼ ¼ ¼ ) + 1/4H(1 0 0 0) = 11/8 bits Z. Li Multimedia Communciation, 2016 Spring p.19 Chain Rule  H(X,Y) = H(X) + H(Y|X)  Corollary: H(X, Y | Z) = H(X | Z) + H(Y | X, Z) Note that: ( , | ) ( | , ) ( | )p x y z p y x z p x z (Multiply by p(z) at both sides, we get )( , , ) ( | , ) ( , )p x y z p y x z p x z ))|,((log)|,(log),,( )|,(log)|,()()|,( ZYXpEzyxpzyxp zyxpzyxpzpZYXH x y z x yz     Proof: ),|()|( )),|((log))|((log)|,( ZXYHZXH ZXYpEZXpEZYXH   Z. Li Multimedia Communciation, 2016 Spring p.20
  • 6.
    General Chain Rule Generalform of chain rule: )...|(),...,( 1,1 1 21 XXXHXXXH i n i in     The joint encoding of a sequence can be broken into the sequential encoding of each sample, e.g. H(X1, X2, X3)=H(X1) + H(X2|X1) + H(X3|X2, X1)  Advantages:  Joint encoding needs joint probability: difficult  Sequential encoding only needs conditional entropy, can use local neighbors to approximate the conditional entropy  context-adaptive arithmetic coding. Adding H(Z):  H(X, Y | Z) + H(z) = H(X, Y, Z) = H(z) + H(X | Z) + H(Y | X, Z) Z. Li Multimedia Communciation, 2016 Spring p.21 General Chain Rule ),...|()...|()(),...( 111211  nnn xxxpxxpxpxxpProof: .),...|( ),...|(log),...( ),...|(log),...( ),...|(log),...( ),...(log),...(),...( 1 11 1 ,...1 111 ,...1 11 1 1 ,...1 1 111 ,...1 111                      n i ii n i xnx iin xnx ii n i n xnx n i iin xnx nnn XXXH xxxpxxp xxxpxxp xxxpxxp xxpxxpXXH Z. Li Multimedia Communciation, 2016 Spring p.22 General Chain Rule 1 1( | ,... )i ip x x x  The complexity of the conditional probability grows as the increase of i. In many cases we can approximate the cond. probability with some nearest neighbors (contexts): 1 1 1( | ,... ) ( | ,... )i i i i L ip x x x p x x x    The low-dim cond prob is more manageable  How to measure the quality of the approximation?  Relative entropy 0 1 1 0 1 0 1 a b c b c a b c b a b c b a Z. Li Multimedia Communciation, 2016 Spring p.23 Relative Entropy – Cost of Coding with Wrong Distr Also known as Kullback Leibler (K-L) Distance, Information Divergence, Information Gain A measure of the “distance” between two distributions:  In many applications, the true distribution p(X) is unknown, and we only know an estimation distribution q(X)  What is the inefficiency in representing X? o The true entropy: o The actual rate: o The difference:         )( )( log )( )( log)()||( Xq Xp E xq xp xpqpD p x 1 ( )log ( ) x R p x p x  2 ( )log ( ) x R p x q x  2 1 ( || )R R D p q  Z. Li Multimedia Communciation, 2016 Spring p.24
  • 7.
    Relative Entropy Properties:        )( )( log )( )( log)()||( Xq Xp E xq xp xpqpD p x ( || ) 0.D p q  ( || ) 0 if and only if q = p.D p q   What if p(x)>0, but q(x)=0 for some x?  D(p||q)=∞  Caution: D(p||q) is not a true distance  Not symmetric in general: D(p || q) ≠ D(q || p)  Does not satisfy triangular inequality. Proved later. Z. Li Multimedia Communciation, 2016 Spring p.25 Relative Entropy How to make it symmetric?  Many possibilities, for example:   1 ( || ) ( || ) 2 D p q D q p ( || ) ( || )D p q D q p  can be useful for pattern classification. )||( 1 )||( 1 pqDqpD  Z. Li Multimedia Communciation, 2016 Spring p.26 Mutual Information i (x | y): conditional self-information )()( ),( log )( )|( log)|()();( ypxp yxp xp yxp yxixiyxi  Note: i(x; y) can be negative, if p(x | y) < p(x).  Mutual information between two events: i(x | y) = -log p(x | y)  A measure of the amount of information that one event contains about another one.  or the reduction in the uncertainty of one event due to the knowledge of the other. Z. Li Multimedia Communciation, 2016 Spring p.27 Mutual Information I(X; Y): Mutual information between two random variables:   ( , ) ( , ) ( ; ) ( , ) ( ; ) ( , )log ( ) ( ) ( , ) D ( , ) || ( ) ( ) log ( ) ( ) x y x y p x y p x y I X Y p x y i x y p x y p x p y p X Y p x y p x p y E p X p Y             But it is symmetric: I(X; Y) = I(Y; X)  Mutual information is a relative entropy:  If X, Y are independent: p(x, y) = p(x) p(y)  I (X; Y) = 0  Knowing X does not reduce the uncertainty of Y. Different from i(x; y), I(X; Y) >=0 (due to averaging) Z. Li Multimedia Communciation, 2016 Spring p.28
  • 8.
    Entropy and MutualInformation ( , ) ( | ) ( ; ) ( , )log ( , )log ( ) ( ) ( ) ( , )log ( | ) ( , )log ( ) ( ) ( | ) x y x y x y x y p x y p x y I X Y p x y p x y p x p y p x p x y p x y p x y p x H X H X Y           2. Similarly: ( ; ) ( ) ( | )I X Y H Y H Y X  1. 3. I(X; Y) = H(X) + H(Y) – H(X, Y) Proof: Expand the definition: ( ; ) ( ) ( | )I X Y H X H X Y    ),()()( )(log)(log),(log),();( YXHYHXH ypxpyxpyxpYXI x y    Z. Li Multimedia Communciation, 2016 Spring p.29 Entropy and Mutual Information H(X) H(Y) I(X; Y)H(X | Y) H(Y | X) Total area: H(X, Y) It can be seen from this figure that I(X; X) = H(X): Proof: Let X = Y in I(X; Y) = H(X) + H(Y) – H(X, Y), or in I(X; Y) = H(X) – H(X | Y) (and use H(X|X)=0). Z. Li Multimedia Communciation, 2016 Spring p.30 Application of Mutual Information a b c b c a b c b a b c b a Mutual information can be used in the optimization of context quantization. Example: If each neighbor has 26 possible values (a to z), then 5 neighbors have 265 combinations:  too many cond probs to estimate. To reduce the number, can group similar data pattern together  context quantization   1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x  Z. Li Multimedia Communciation, 2016 Spring p.31 Application of Mutual Information We need to design the function f( ) to minimize the conditional entropy   1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x  )...|(),...,( 1,1 1 21 XXXHXXXH i n i in    ( | ) ( ) ( ; )H X Y H X I X Y But The problem is equivalent to maximizing the mutual information between Xi and f(x1, … xi-1). ))...(|( 1,1 XXfXH ii  For further info: Liu and Karam, Mutual Information-Based Analysis of JPEG2000 Contexts, IEEE Trans Image Processing, VOL. 14, NO. 4, APRIL 2005, pp. 411-422. Z. Li Multimedia Communciation, 2016 Spring p.32
  • 9.
    Outline  Lecture 01ReCap  Info Theory on Entropy  Entropy Coding  Prefix Coding  Kraft-McMillan Inequality  Shannon Codes Z. Li Multimedia Communciation, 2016 Spring p.33 Variable Length Coding Design the mapping from source symbols to codewords Lossless mapping Different codewords may have different lengths Goal: minimizing the average codeword length The entropy is the lower bound. Z. Li Multimedia Communciation, 2016 Spring p.34 Classes of Codes Non-singular code: Different inputs are mapped to different codewords (invertible). Uniquely decodable code: any encoded string has only one possible source string, but may need delay to decode. Prefix-free code (or simply prefix, or instantaneous): No codeword is a prefix of any other codeword.  The focus of our studies.  Questions: o Characteristic? o How to design? o Is it optimal? All codes Non-singular codes Uniquely decodable codes Prefix-free codes Z. Li Multimedia Communciation, 2016 Spring p.35 Prefix Code  Examples X Singular Non-singular, But not uniquely decodable Uniquely decodable, but not prefix-free Prefix-free 1 0 0 0 0 2 0 010 01 10 3 0 01 011 110 4 0 10 0111 111 Need punctuation ……01011… Need to look at next bit to decode previous code. Z. Li Multimedia Communciation, 2016 Spring p.36
  • 10.
    Carter-Gill’s Conjecture [1974] Carter-Gill’sConjecture [1974]  Every uniquely decodable code can be replaced by a prefix-free code with the same set of codeword compositions.  So we only need to study prefix-free code. Z. Li Multimedia Communciation, 2016 Spring p.37 Prefix-free Code Can be uniquely decoded. No codeword is a prefix of another one. Also called prefix code Goal: construct prefix code with minimal expected length. Can put all codewords in a binary tree: 0 1 0 1 0 1 0 10 110 111 Root node leaf node Internal node  Prefix-free code contains leaves only.  How to express the requirement mathematically? Z. Li Multimedia Communciation, 2016 Spring p.38 Kraft-McMillan Inequality 12 1   N i li • The codeword lengths li, i=1,…N of a prefix code over an alphabet of size D(=2) satisfies the inequality Conversely, if a set of {li} satisfies the inequality above, then there exists a prefix code with codeword lengths li, i=1,…N.  The characteristic of prefix-free codes: Z. Li Multimedia Communciation, 2016 Spring p.39 Kraft-McMillan Inequality 2212 11 L N i lL N i l ii        Consider D=2: expand the binary code tree to full depth L = max(li) 0 10 110 111  Number of nodes in the last level:  Each code corresponds to a sub-tree:  The number of off springs in the last level:  K-M inequality: # of L-th level offsprings of all codes is less than 2^L ! ilL 2 L 2 L = 3 Example: {0, 10, 110, 111} Z. Li Multimedia Communciation, 2016 Spring p.40 2^3=8 {4, 2, 0, 0}
  • 11.
    Kraft-McMillan Inequality 0 10 110 111 11 Invalidcode: {0, 10, 11, 110, 111} Leads to more than 2^L offspring: 12> 23 12 1     i li  K-M inequality: Z. Li Multimedia Communciation, 2016 Spring p.41 Extended Kraft Inequality Countably infinite prefix code also satisfies the Kraft inequality:  Has infinite number of codewords. Example:  0, 10, 110, 1110, 11110, 111……10, …… (Golomb-Rice code, next lecture)  Each codeword can be mapped to a subinterval in [0, 1] that is disjoint with others (revisited in arithmetic coding) 1 1     i li D ) 0 ) 10 ) 0 0.5 0.75 0.875 1 110 …… Z. Li Multimedia Communciation, 2016 Spring p.42 0 10 110 …. L = 3 Optimal Codes (Advanced Topic) How to design the prefix code with the minimal expected length? Optimization Problem: find {li} to 1.. min    i i l ii l Dts lp  Lagrangian solution:  Ignore the integer codeword length constraint for now  Assume equality holds in the Kraft inequality    il ii DlpJ Minimize Z. Li Multimedia Communciation, 2016 Spring p.43 Optimal Codes    il ii DlpJ    0lnLet     il i i DDp l J   D p D ili ln  1intongSubstituti   il D Dln 1  i l pD i  or iDi pl log- *  The optimal codeword length is the self-information of an event. Expected codeword length: )(log* XHpplpL DiDiii    Entropy of X ! Z. Li Multimedia Communciation, 2016 Spring p.44
  • 12.
    Optimal Code Theorem: Theexpected length L of any prefix code is greater or equal to the entropy with equality holds iff is not integer in general.iDi p-l log *  ( )DL H X  Proof: ( ) log log log logi i D i i i D i l i i D i D i i D l L H X p l p p p p D p p p D            This reminds us the definition of relative entropy D(p ||q), but we need to normalize D-li. i l pD i  Z. Li Multimedia Communciation, 2016 Spring p.45 Pi is Diadic, ½, ¼, 1/8, 1/16… Optimal Code  Let distribution be diadic, i.e, = / ∑ because D(p||q) >= 0, and 1 1 for prefix code.i N l i D   The equality holds iff both terms are 0:  i l pD i  or logD ip is an integer. Z. Li Multimedia Communciation, 2016 Spring p.46 − = log = log + log 1/( ) = ( | + log 1 ∑ Optimal Code D-adic: a probability distribution is called D-adic with respect to D if each probability is equal to D-n for some integer n:  Example: {1/2, 1/4, 1/8, 1/8} Therefore the optimality can be achieved by prefix code iff the distribution is D-adic.  Previous example:  Possible codewords: o {0, 10, 110, 111} log {1,2,3,3}D ip  Z. Li Multimedia Communciation, 2016 Spring p.47 Shannon Code: Bounds on Optimal Code is not integer in general. iDi p-l log *  Practical codewords have to be integer.        i Di p l 1 logShannon Code: Is this a valid prefix code? Check Kraft inequality .1 1 log 1 log           i ppl pDDD i D i D i 1 1 log 1 log  i Di i D p l p       i Di p l 1 log 1)()(  XHLXH DD Yes ! This is just one choice. May not be optimal (see example later) Z. Li Multimedia Communciation, 2016 Spring p.48
  • 13.
    Optimal Code The optimalcode with integer lengths should be better than Shannon code 1)()( *  XHLXH DD  To reduce the overhead per symbol:  Encode a block of symbols {x1, x2, …, xn} together  ),...,,( 1 ),...,,(),...,,( 1 212121 nnnn xxxlE n xxxlxxxp n L     1),...,,(),...,,(),...,,( 212121  nnn XXXHxxxlEXXXH Assume i.i.d. samples: )(),...,,( 21 XnHXXXH n  n XHLXH n 1 )()(  )(XHLn  if stationary. (entropy rate) Z. Li Multimedia Communciation, 2016 Spring p.49 Optimal Code Impact of wrong pdf: what’s the penalty if the pdf we use is different from the true pdf? True pdf: p(x) Codeword length: l(x) Estimated pdf: q(x) Expected length: Epl(X) 1)||()()()||()(  qpDpHXlEqpDpH p Proof: assume Shannon code        )( 1 log)( xq xl               1 )( 1 log)( )( 1 log)()( xq xp xq xpXlEp 1)||()(1 )( 1 )( )( log)(                 qpDXH xpxq xp xp The lower bound is derived similarly. Z. Li Multimedia Communciation, 2016 Spring p.50 Shannon Code is not optimal Example:  Binary r.v. X: p(0)=0.9999, p(1)=0.0001. Entropy: 0.0015 bits/sample Assign binary codewords by Shannon code:        )( 1 log)( xp xl .1 9999.0 1 log2     .14 0001.0 1 log2        Expected length: 0.9999 x 1+ 0.0001 x 14 = 1.0013.  Within the range of [H(X), H(X) + 1].  But we can easily beat this by the code {0, 1}  Expected length: 1. Z. Li Multimedia Communciation, 2016 Spring p.51 Q&A Z. Li Multimedia Communciation, 2016 Spring p.52