3 Entropy, Relative Entropy and Mutual Information.pdf

UNIVERSITY OF MINES AND TECHNOLOGY
(UMAT) - TARKWA
School of Railways and Infrastructure
Development (SRID)
DEPARTMENT OF COMPUTER SCI. AND
ENG.
INFORMATION THEORY
CE 379
Monday,
13
February
2023
1
Course Lecturer: Engr Dr Albert K Kwansah Ansah PE MIETGH
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

ENTROPY, RELATIVE ENTROPY, AND MUTUAL
INFORMATION
Monday,
13
February
2023
2
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

ENTROPY, RELATIVE ENTROPY AND MUTUAL
INFORMATION
In this chapter:
 We will introduce certain key measures of
information, that play crucial roles in theoretical
and operational characterisations throughout the
course.
i.e. the entropy, the mutual information, and the
relative entropy
 We will also exhibit some key properties
exhibited by these information measures.
Monday,
13
February
2023
3
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

ENTROPIES DEFINED AND WHY THEY ARE MEASURES OF
INFORMATION (CONT’D)
Monday,
13
February
2023
4
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah
Notation
 Random Variables (objects): used more
“loosely” i.e.
 Alphabets:
 Specific Values:


Monday,
13
February
2023
5
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah


Monday,
13
February
2023
6
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

 Intuitively, the more improbable an event is, the
more informative it is; and so the monotonic
behaviour of (1) seems appropriate.
But why the logarithm?
 The log measure is justified by desire for info to
be additive for the algebra to reflect the Rules of
Probability
 Thus, total information received is the sum of the
individual pieces and the probabilities of
independent events multiply to give their
combined probabilities
Monday,
13
February
2023
7
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

 Logs are taken in order for the joint probability of
independent events or messages to contribute
additively to the information gained.
NB: This principle can also be understood in terms
of the combinatorics of state spaces.
Monday,
13
February
2023
8
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Example:
Assume we have two independent problems, one
with n possible solutions or states each having
probability pn, and other with m possible solutions
or states each having probability pm.
Then number of combined states is mn, each of
these has probability pmpn. We want to say that the
information gained by specifying the solution to
both problems is the sum of that gained from each
one.
This desired property is achieved:
(2)
Monday,
13
February
2023
9
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

A Note on Logarithms:
In info theory we wish to compute base-2 logs of
quantities, but most calculators offer Napierian
(base 2.718...) and decimal (base 10) logarithms.
So, the ff conversions are useful:
Henceforth we will omit subscript; base-2 is always
presumed.
Monday,
13
February
2023
10
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Entropy of Ensembles:
 An ensemble is the set of outcomes of one or more
random variables i.e. probabilities are attached
to the outcomes
Probabilities are non-uniform, thus; event i will
have probability pi, and sums to 1 because all
possible outcomes are included;
Hence, form a probability distribution:
(3)
Monday,
13
February
2023
11
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Entropy of Ensembles (cont’d):
 The H is simply the average entropy of all the
elements in it.
It can be computed by weighting each of the log
pi contributions by its probability pi:
(4)
(4) allows us to talk of info content or entropy of
a random variable, from knowledge of the
probability distribution that it obeys.
Monday,
13
February
2023
12
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

nb: H does not depend upon the actual values
taken by the random variable! - Only upon their
relative probabilities.
Scenario:
We consider a random variable that takes on
only two values, one with probability p and other
with probability (1 - p).
H is a concave function of this distribution, and
equals 0 if p = 0 or p = 1:
Monday,
13
February
2023
13
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Monday,
13
February
2023
14
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Example of entropy as average uncertainty:
Various letters of English language have the ff
relative frequencies (probabilities) in descending
order:
If they are equiprobable, H of the ensemble
would have been log2(1/26) = 4.7 bits.
It means that as few as only four ‘Yes/No’
questions are needed, in principle, to identify
one of the 26 letters of the alphabet.
Monday,
13
February
2023
15
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Example of entropy as average uncertainty (cont’d):
How can this be true?
That is the subject matter of Shannon’s SOURCE
CODING THEOREM
We note the important assumption: that the
“source statistics” are known! i.e. the a priori
probabilities of the message generator, to
construct an optimal code.
Monday,
13
February
2023
16
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Several further measures of entropy need to be
defined
i.e. marginal, joint, and conditional probabilities
of random variables.
Some key relationships will emerge, that we can
apply to the analysis of communication channels.
Notation
Capital letters X and Y to name random variables
Lower case x and y refers their respective outcomes
Monday,
13
February
2023
17
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Notation (cont’d)
These are drawn from particular sets:
and
The probability of a particular outcome p(x = ai) is
denoted pi, with 0 ≤ pi ≤ 1 and
Joint ensemble
An ensemble is just a random variable X, whose
entropy was defined in (4).
Monday,
13
February
2023
18
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Joint ensemble (cont’d)
A joint ensemble ‘XY’ is an ensemble whose
outcomes are ordered pairs x, y with
and
Joint ensemble XY defines probability distribution
p(x, y) over all possible joint outcomes x, y.
Marginal probability
From Sum Rule, the probability of X taking on a
value x = ai is the sum of the joint probabilities of
this outcome for X and all possible outcomes for Y:
Monday,
13
February
2023
19
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Marginal probability (cont’d)
Can simplify this notation to:
and similarly:
Conditional probability:
From the Product Rule, we see that the conditional
probability that x = ai, given that y = bj, is:
Monday,
13
February
2023
20
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah
Sum Rule

Conditional probability (cont’d)
Can simplify this notation to:
and similarly:
We now define various entropy measures for joint
ensembles:
Monday,
13
February
2023
21
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Joint entropy of XY
(5)
We note that in comparison (5) to (4), the ‘-’ sign in
front is replaced by taking the reciprocal of p inside
the logarithm
From this definition, it follows that joint entropy is
additive if X and Y are independent R.V.s:
ASSIGNMENT: Prove this.
Monday,
13
February
2023
22
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Conditional entropy of an ensemble X, given y = bj
Measures the uncertainty remaining about random
variable X after specifying that R.V. Y has taken on
a particular value y = bj.
It is defined naturally as the entropy of the
probability distribution :
(6)
If we now consider the above quantity averaged
over all possible outcomes of Y, each weighted by
its probability p(y), then we arrive at the...
Monday,
13
February
2023
23
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Conditional entropy of an ensemble X, given an
ensemble Y:
(7)
and from the Sum Rule, if we move p(y) from outer
summation over y, to inside inner summation over
x, the two probability terms combine and become
just p(x, y) summed over all x, y.
(8)
This measures the average uncertainty that
remains about X, when Y is known.
Monday,
13
February
2023
24
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Chain Rule for Entropy
Joint entropy, conditional entropy, and marginal
entropy for two ensembles X and Y are related by:
(9)
Joint entropy of a pair of R. V.s is the entropy of
one plus the conditional entropy of the other.
Monday,
13
February
2023
25
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Corollary to the Chain Rule
If X, Y, Z, are discrete R.V.s, the conditionalising of
the joint distribution of any two upon the third, is
also expressed by a Chain Rule:
(10)
Monday,
13
February
2023
26
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Independence Bound on Entropy
A consequence of Chain Rule for Entropy is that if
there are many different R. V.s X1, X2, …., Xn, then
sum of all their individual entropies is an upper
bound on their joint entropy:
(11)
Their joint entropy only reaches this upper bound if
all of the R.V.s are independent
Monday,
13
February
2023
27
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Mutual Information between X and Y
Mutual information between two R.V.s measures
the amount of information that one conveys about
the other.
Equivalently, it measures the average reduction in
uncertainty about X that results from learning
about Y.
It is defined:
(12)
Monday,
13
February
2023
28
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

X says as much about Y as Y says about X.
NB: In case X and Y are independent R. V.s, then
the numerator inside the logarithm equals the
denominator, then mutual information equals zero.
Non-negativity: mutual information is always ≥ 0.
When the two R.V.s are perfectly correlated, their
mutual information is the entropy of either one.
Monday,
13
February
2023
29
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Thus, I(X; X) = H(X): the mutual information of a
R.V. with itself is just its entropy.
Hence, the entropy H(X) of a random variable X is
sometimes referred to as its self-information.
These properties are reflected in three equivalent
definitions for mutual information btn X and Y:
I(X; Y ) = H(X) − H(X | Y) (13)
I(X; Y ) = H(Y ) − H(Y | X) = I(Y; X) (14)
I(X; Y ) = H(X) + H(Y ) − H(X, Y) (15)
Monday,
13
February
2023
30
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Effectively, the mutual information I(X; Y) is the
intersection btn H(X) and H(Y), since it represents
their statistical dependence.
In the Venn diagram the portion of H(X) that does
not lie within I(X; Y) is just H(X | Y) portion of H(Y)
that does not lie within I(X; Y ) is just H(Y | X).
Monday,
13
February
2023
31
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Venn diagram illustrating the relationship between
entropy and mutual information.
Monday,
13
February
2023
32
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Distance D(X,Y) between X and Y
Amount by which joint entropy of two R.Vs exceeds
their mutual information is a measure of the
“distance” between them:
D(X, Y ) = H(X, Y ) − I(X; Y ) (16)
NB: This quantity satisfies the standard axioms for
a distance:
D(X, Y) ≥ 0, D(X, X) = 0, D(X, Y) = D(Y, X) and
D(X, Z) ≤ D(X, Y ) + D(Y, Z)
Monday,
13
February
2023
33
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Relative entropy, or Kullback-Leibler distance
Another important measure of the “distance” btn
two R.V.s is the relative entropy or Kullback-
Leibler distance.
It is also called the information for discrimination.
If p(x) and q(x) are two probability distributions
defined over the same set of outcomes x, then their
relative entropy is:
(17)
Monday,
13
February
2023
34
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

NB: DKL(pǁq) ≥ 0 and if p(x) = q(x) then DKL(p‖q) = 0
This metric is not strictly a “distance”, since in
general it lacks symmetry: DKL(p‖q) ≠ DKL(p‖q).
Relative entropy DKL(p‖q) is a measure of the
“inefficiency” of assuming that a distribution is q(x)
when in fact it is p(x).
Monday,
13
February
2023
35
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Example
If we have an optimal code for the distribution p(x)
i.e. we use on average H(p(x)) bits, its entropy, to
describe it, then the number of additional bits that
we would need to use if we instead described p(x)
using an optimal code for q(x), would be their
relative entropy DKL(p‖q).
Monday,
13
February
2023
36
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Fano’s Inequality
We note that conditioning reduces entropy:
i.e. H(X|Y) ≤ H(X).
If X and Y are perfectly correlated, then their
conditional entropy is 0.
If X is any deterministic function of Y, then there
remains no uncertainty about X once Y is known
and so their conditional entropy H(X|Y) = 0.
Monday,
13
February
2023
37
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Fano’s Inequality
Fano’s Inequality relates the probability of error Pe
in guessing X from knowledge of Y to their
conditional entropy H(X|Y) when the no. of possible
outcomes is |A|, i.e. length of a symbol alphabet:
(18)
The lower bound on Pe is a linearly increasing
function of H(X|Y).
Monday,
13
February
2023
38
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

The “Data Processing Inequality”
If R.V.s X, Y, and Z form a Markov chain i.e.
conditional distribution of Z depends only on Y and
is independent of X, denoted as X → Y → Z, then
the mutual information must be monotonically
decreasing over steps along the chain:
I(X; Y ) ≥ I(X; Z) (19)
Monday,
13
February
2023
39
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

We now turn to applying these measures and
relationships to the study of communications
channels.
Monday,
13
February
2023
40
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

Thank You
Good Luck
Monday,
13
February
2023
41
Prepared
by:
Engr
Dr
AK
Kwansah
Ansah

3 Entropy, Relative Entropy and Mutual Information.pdf

Recommended

Recommended

More Related Content

Similar to 3 Entropy, Relative Entropy and Mutual Information.pdf

Similar to 3 Entropy, Relative Entropy and Mutual Information.pdf (20)

Recently uploaded

Recently uploaded (20)

3 Entropy, Relative Entropy and Mutual Information.pdf