1. Block 2: Introduction to Information Theory
Francisco J. Escribano
October 25, 2016
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 1 / 62
2. Table of contents
1 Motivation
2 Entropy
3 Source coding
4 Mutual information
5 Discrete memoryless channels
6 Entropy and mutual information for continuous RRVV
7 AWGN channel capacity theorem
8 Bridging towards channel coding
9 Conclusions
10 References
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 2 / 62
4. Motivation
Motivation
Information Theory is a discipline established during the 2nd half of the
XXth Century.
It relies on solid mathematical foundations [1, 2, 3, 4].
It tries to address two basic questions:
◮ To what extent can we compress data for a more efficient usage of the
limited communication resources?
→ Entropy
◮ Which is the largest possible data transfer rate for given resources and
conditions?
→ Channel capacity
Key concepts for Information Theory are thus entropy (H (X)) and
mutual information (I (X; Y)).
◮ X, Y are random variables (RRVV) of some kind.
◮ Information will naturally be characterized by means of RRVV.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 4 / 62
5. Motivation
Motivation
Up to the 40’s, it was common wisdom in telecommunications that the error rate
increased with increasing data rate.
◮ Claude Shannon demonstrated that errorfree transmission may be possible under
certain conditions.
Information Theory provides strict bounds for any communication system.
◮ Maximum compression → H (X); or minimum I X; ˆX , with controlled distortion.
◮ Maximum transfer rate → maximum I (X; Y).
◮ Any given communication system works between said limits.
The mathematics behind is not always constructive, but provides guidelines to design
algorithms to improve communications given a set of available resources.
◮ The resources in this context are parameters such as available transmission power,
available bandwidth, signal-to-noise ratio and the like.
X YX ˆX
ChannelCompression
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 5 / 62
7. Entropy
Entropy
Consider a discrete memoryless data source, that issues a symbol from
a given set at a given symbol rate, chosen randomly and independently
from the previous and the subsequent ones.
ζ = {s0, · · · , sK−1} , P (S = sk) = pk,
k = 0, 1, · · · , K − 1; K is the source radix
Information quantity is a random variable defined as
I (sk) = log2
1
pk
with properties
I (sk) = 0 if pk = 1
I (sk) > I (si ) if pk < pi
I (sk) ≥ 0, 0 ≤ pk ≤ 1
I (sl , sk) = I (sl ) + I (sk)
Definition is conventional but also convenient: it relates uncertainty
and information.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 7 / 62
8. Entropy
Entropy
The source entropy is a measurement of its “information content”,
and it is defined as
H (ζ) = E{pk } [I (sk)] =
K−1
k=0
pk · I (sk)
H (ζ) =
K−1
k=0
pk · log2
1
pk
0 ≤ H (ζ) ≤ log2 (K)
pj = 1 ∧ pk = 0, k = j
H (ζ) = 0
pk = 1
K , k = 0, · · · , K − 1
H (ζ) = log2 (K)
There is no potential information (uncertainty) in deterministic (“de-
generated”) sources.
Entropy is conventionally measured in “bits”.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 8 / 62
9. Entropy
Entropy
E.g. memoryless binary source: H (p) = −p · log2 (p) − (1 − p) · log2 (1 − p).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1H(p)
p
Figure 1: Entropy of a memoryless binary source.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 9 / 62
11. Source coding
Source coding
The field of source coding addresses the issues related to handling the
output data from a given source, from the point of view of Information
Theory.
One of the main issues is data compression, its theoretical limits and
the related practical algorithms.
◮ The most important prerequisite in communications at the PHY is to
keep data integrity → any transformation has to be fully invertible (loss-
less compression).
◮ Lossy compression is a key field at other communication levels: for ex-
ample, compressing a stream of video to MPEG-2 standard, to create a
transport frame.
◮ In other related fields, like cryptography, some non-invertible algorithms
for data compression are of utmost interest (i.e. hash functions, to create
almost-unique tokens).
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 11 / 62
12. Source coding
Source coding
We may choose to represent the data from the source by asigning to
each symbol sk a corresponding codeword (binary, in our case).
The aim of this source coding process is to try to represent the source
symbols more efficiently.
◮ We focus on variable-length binary, invertible codes → the codewords
are unique blocks of binary symbols of length lk , for each symbol sk.
◮ The correspondence codeword ↔ original symbol constitutes a code.
Average codeword length:
L = E [lk] =
K−1
k=0
pk · lk
Coding efficiency:
η = Lmin
L
≤ 1
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 12 / 62
13. Source coding
Shannon’s source coding theorem
This theorem establishes the limits for lossless data compression [5].
◮ N iid1
random variables with entropy H (ζ) each, can be compressed
into N · H (ζ) bits with negligible information loss risk as N → ∞. If
they are compressed into less than N · H (ζ) bits it is certain that some
information will be lost.
In practical terms, for a single random variable, this means
Lmin ≥ H (ζ)
And, therefore, the coding efficiency can be defined as
η = H(ζ)
L
Like other results in Information Theory, this theorem provides the
limits, but does not tell how actually we can reach them.
1
Independent and identically distributed.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 13 / 62
14. Source coding
Example of source coding: Huffman coding
Huffman coding provides a practical algorithm to perform source cod-
ing within the limits shown.
It is an instance of a class of codes, called prefix codes
◮ No binary word within the codeset is the prefix of any other one.
Properties:
◮ Unique coding.
◮ Instantaneous decoding.
◮ The lengths lk meet the Kraft-McMillan inequality [2]:
K−1
k=0
2−lk
≤ 1.
◮ L of the code is bounded by
H (ζ) ≤ L < H (ζ) + 1 H (ζ) = L ↔ pk = 2−lk
Meeting the Kraft-McMillan inequality guarantees that a prefix code
with the given lk can be constructed.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 14 / 62
15. Source coding
Huffman coding
Algorithm to perform Huffman coding:
1 List symbols sk in a column, in order of decreasing probabilities.
2 Compose the probabilities of the last 2 symbols: probabilities are added
into a new dummy compound value/symbol.
3 Reorder the probabilities set in an adjacent column, putting the new
dummy symbol probability as high as possible, retiring values involved.
4 In the process of moving probabilities to the right, keep the values of the
unaffected symbols (making room to the compound value if needed),
but assign a 0 to one of the symbols affected, a 1 to the other (top or
bottom, but keep always the same criterion along the process).
5 Start afresh the process in the new column.
6 When only two probabilities remain, assign last 0, 1 and stop.
7 Write the binary codeword corresponding to each original symbol by trac-
ing back the trajectory of each of the original symbols and the dummy
symbols they take part in, from the last towards the initial column.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 15 / 62
16. Source coding
Huffman coding example
Figure 2: Example of Huffman coding: assigning final codeword patterns proceeds from right to left.
To characterize the resulting code, it is important to calculate:
H (ζ) ; L =
K−1
k=0
pk · lk ; η =
H(ζ)
L
; σ2
=
K−1
k=0
pk · lk − L
2
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 16 / 62
17. Source coding
Managing efficiency in source coding
From the point of view of data communication (sequences of data
symbols), it is useful to define the n-th extension of the source ζ:
◮ A symbol of the n-th extension of the source is built by taking n succes-
sive symbols from it (symbols in ζn
).
◮ If source is iid, for si ∈ ζ, the new extended source ζn
is characterized
by the probabilities:
P (σ = (s1 · · · sn) ∈ ζn
) =
n
i=1 P (si ∈ ζ)
◮ Given that the sequence si in σ ∈ ζn
is independent and identically
distributed (iid) for any n, the entroy of this new source is:
H (ζn
) = n · H (ζ)
By applying Huffman coding to ζn, we derive a code with average
length L characterized by:
H (ζn
) ≤ L < H (ζn
) + 1
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 17 / 62
18. Source coding
Managing efficiency in source coding
Considering the equivalent word length per original symbol
Ls = L
n → H (ζ) ≤ Ls < H (ζ) + 1
n
The final efficiency turns out to be
η = H(ζn)
L
= n·H(ζ)
L
= H(ζ)
Ls
Efficiency tends thus to 1 as n → ∞, as shown by the previous inequal-
ity.
The method of the n-th extension of the source, nevertheless, is ren-
dered impractical for large n, due to:
◮ Increasing delay.
◮ Increasing coding and decoding complexity.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 18 / 62
20. Mutual information
Joint entropy
We extend the concept of entropy to 2 RRVV.
◮ 2 or more RRVV are needed when analyzing communication channels
from the point of view of Information Theory.
◮ These RRVV can also be seen as a random vector.
◮ The underlying concept is the characterization of channel input vs chan-
nel output, and what we can get about the former by observing the
latter.
Joint entropy of 2 RRVV:
H (X, Y) =
x∈X y∈Y
p (x, y) · log2
1
p(x,y)
H (X, Y) = Ep(x,y) log2
1
P(X,Y)
Properties:
H (X, Y) ≥ max [H (X) , H (Y)] ; H (X, Y) ≤ H (X) + H (Y)
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 20 / 62
21. Mutual information
Conditional entropy
Conditional entropy of 2 RRVV:
H (Y|X) =
x∈X
p (x) · H (Y|X = x) =
=
x∈X
p (x)
y∈Y
p (y|x) · log2
1
p(y|x) =
=
x∈X y∈Y
p (x, y) · log2
1
p(y|x) = Ep(x,y) log2
1
P(Y|X)
H (Y|X) is a measure of the uncertainty in Y once X is known.
Properties:
H (Y|Y) = 0; H (Y|X) = H (Y) if X and Y are independent
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 21 / 62
22. Mutual information
Chain rule
Relation between joint and conditional entropy of 2 RRVV:
H (X, Y) = H (X) + H (Y|X) .
The expression points towards the following common wisdom result:
“joint knowledge about X and Y is the knowledge about X plus the
information in Y not related to X”.
Proof:
p (x, y) = p (x) · p (y|x) ;
log2 (p (x, y)) = log2 (p (x)) + log2 (p (y|x)) ;
Ep(x,y) [log2 (P (X, Y))] = Ep(x,y) [log2 (P (X))] +
+Ep(x,y) [log2 (P (Y|X))] .
Corollary: H (X, Y|Z) = H (X|Z) + H (Y|X, Z).
As another consequence: H (Y|X) ≤ H (Y).
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 22 / 62
23. Mutual information
Relative entropy
H (X) measures the quantity of information needed to describe the RV
X on average.
The relative entropy D (p q) measures the increase in information
needed to describe X when it is characterized by means of a distribution
q (x) instead of p (x).
X; p (x) → H (X)
X; q (x) → H (X) + D (p q)
Definition of relative entropy (aka Kullback-Leibler divergence, or
improperly “distance”):
D (p q) =
x∈X
p (x) · log2
p(x)
q(x) = Ep(x) log2
P(X)
Q(X)
Note that: limx→0 (x · log (x)) = 0; 0 · log 0
0 = 0; p · log p
0 = ∞.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 23 / 62
24. Mutual information
Relative entropy and mutual information
Properties of relative entropy:
1 D (p q) ≥ 0.
2 D (p q) = 0 ↔ p (x) = q (x).
3 It is not symmetric. Therefore, it is not a true distance from the metrical
point of view.
Mutual information of 2 RRVV:
I (X; Y) = H (X) − H (X|Y) =
=
x∈X y∈Y
p (x, y) · log2
p(x,y)
p(x)·p(y) =
= D (p (x, y) p (x) · p (y)) = Ep(x,y) log2
P(X,Y)
P(X)·P(Y) .
The mutual information between X and Y is the information in X,
minus the information in X not related to Y.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 24 / 62
25. Mutual information
Mutual information
Properties of mutual information:
1 Symmetry: I (X; Y) = I (Y; X).
2 Non negative: I (X; Y) ≥ 0.
3 I (X; Y) = 0 ↔ X and Y are independent.
}H (Y|X)I (X; Y)
H (X) H (Y)
H (X, Y)
H (X|Y)
Figure 3: Relationship between entropy and mutual information.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 25 / 62
26. Mutual information
A glimpse of lossy compression
Now that we have defined mutual information, we can examine lossy
compression under the so-called Rate-Distortion Theory.
A source X is compressed into a new source ˆX, a process characterized
by the probabilities
p (x) ; p (ˆx, x) = p (ˆx|x) p (x)
The quality of the compression is measured according to a distortion
parameter d (x, ˆx) and its expected value D:
D = Ep(ˆx,x) [d (x, ˆx)] = x ˆx d (x, ˆx) p (ˆx|x) p (x)
d (x, ˆx) is any convenient distance measurement
◮ The Euclidean distance, d (x, ˆx) = |x − ˆx|2
.
◮ The Hamming distance, d (x, ˆx) = 1 if x = ˆx, and 0 otherwise.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 26 / 62
27. Mutual information
A glimpse of losssy compression
In lossy compression, we want ˆX to be represented by the lowest number
of bits, and this means finding a distribution p (ˆx|x) such that
R = minp(ˆx|x) I X; ˆX = minp(ˆx|x) H (X) − H X|ˆX
R is the rate in bits of the compressed source.
The source is compressed to an entropy lower than H (X): since the
process is lossy, H X|ˆX = 0.
The minimum is trivially 0 when X and ˆX are fully unrelated, so that
the minimization is performed subject to D ≤ D∗, for a maximum
admissible distortion D∗.
Conversely, the problem can be addressed as the minimization of the
distortion D subject to I X; ˆX ≤ R∗, for a maximum admissible rate
R∗ < H (X).
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 27 / 62
28. Discrete memoryless channels
Discrete memoryless channels
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 28 / 62
29. Discrete memoryless channels
Discrete memoryless channels
Communication channel:
◮ Input/output system where the output is a probabilistic function of the
input.
A DMC (Discrete Memoryless Channel) consist in
◮ Input alphabet X = {x0, x1, · · · , xJ−1}, corresponding to RV X.
◮ Output alphabet Y = {y0, y1, · · · , yK−1}, corresponding to RV Y, noisy
version of RV X.
◮ A set of transition probabilities linking input and output, following
{p (yk |xj)} k=0,1,··· ,K−1; j=0,1,···J−1
p (yk |xj ) = P (Y = yk |X = xj )
◮ X and Y are finite and discrete.
◮ The channel is memoryless, since the output symbol depends only on
the current input symbol.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 29 / 62
30. Discrete memoryless channels
Discrete memoryless channels
Channel matrix P, J × K
P =
p (y0|x0) p (y1|x0) · · · p (yK−1|x0)
p (y0|x1) p (y1|x1) · · · p (yK−1|x1)
...
...
...
...
p (y0|xJ−1) p (y1|xJ−1) · · · p (yK−1|xJ−1)
The channel is said to be symmetric when each column is a permutation
of any other, and each row is a permutation of any other.
Important property
K−1
k=0
p (yk|xj ) = 1, ∀ j = 0, 1, · · · , J − 1.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 30 / 62
31. Discrete memoryless channels
Discrete memoryless channels
Output Y is probabilistically determined by the input (a priori) proba-
bilities and the channel matrix, following
pX = (p (x0) , p (x1) , · · · , p (xJ−1)), and p (xj) = P (X = xj)
pY = (p (y0) , p (y1) , · · · , p (yK−1)), and p (yk) = P (Y = yk)
p (yk) =
J−1
j=0
p (yk|xj ) · p (xj ), ∀ k = 0, 1 · · · , K − 1
Therefore, pY = pX · P
When J = K and yj is the correct choice when sending xj, we can
calculate the average symbol error probability as
Pe =
J−1
j=0
p (xj) ·
K−1
k=0,k=j
p (yk|xj )
The probability of correct transmission is 1 − Pe.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 31 / 62
32. Discrete memoryless channels
Discrete memoryless channels
Example of discrete memoryless channel: modulation with hard deci-
sion.
xj
yk
Figure 4: 16-QAM transmitted constellation.
p(yk|xj )
Figure 5: 16-QAM received constellation with noise.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 32 / 62
33. Discrete memoryless channels
Disccrete memoryless channels
Example of non-symmetric channel → the binary erasure channel, typ-
ical of storage systems.
◮ Reading data in storage systems can also be modeled as sending infor-
mation through a channel with given probabilistic properties.
◮ An erasure marks the complete uncertainty over the symbol read.
◮ There are methods to recover from erasures, based on the principles of
Information Theory.
x0=0
x1=1
y0=0
y1=ǫ
y2=1
1−p
1−p
p
p
Figure 6: Diagram showing the binary erasure channel.
P =
1 − p p 0
0 p 1 − p
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 33 / 62
34. Discrete memoryless channels
Channel capacity
Mutual information depends on P and pX. Characterizing the possibil-
ities of the channel requires removing dependency with pX.
Channel capacity is defined as
C = maxpX
[I (X; Y)]
◮ It is the maximum average mutual information enabled by the channel,
in bits per channel use, and the best we can get out of it in point of
reliable information transfer.
◮ It only depends on the channel transition probabilities P.
◮ If the channel is symmetric, the distribution pX that maximizes I (X; Y)
is the uniform one (equiprobable symbols).
Channel coding is a process where controled redundancy is added to
protect data integrity.
◮ A channel encoded information sequence is represented by a block of n
bits, encoded from an information sequence represented by k ≤ n bits.
◮ The so-called code rate is calculated as R = k/n ≤ 1.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 34 / 62
35. Discrete memoryless channels
Noisy-channel coding theorem
Consider a discrete source ζ, emitting symbols with period Ts.
◮ The binary information rate of such source is H(ζ)
Ts
(bits/s).
Consider a discrete memoryless channel, used to send coded data each
Tc seconds.
◮ The maximum possible data transfer rate would be C
Tc
(bits/s).
The noisy-channel coding theorem states the following [5]:
◮ If H(ζ)
Ts
≤ C
Tc
, there exists a coding scheme that guarantees errorfree
transmission (i.e. Pe arbitrarily small).
◮ Conversely, if H(ζ)
Ts
> C
Tc
, the communication cannot be made reliable
(i.e. we cannot have a bounded Pe, so small as desired).
Please note that again the theorem is asymptotic, and not constructive:
it does not say how to actually reach the limit.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 35 / 62
36. Discrete memoryless channels
Noisy-channel coding theorem
Example: a binary symmetric channel.
Consider a binary source ζ = {0, 1}, with equiprobable symbols.
◮ H (ζ) = 1 info bits/“channel use”.
◮ Source works at a rate of 1
Ts
“channel uses”/s, and H(ζ)
Ts
info bits/s.
Consider an encoder with rate k
n info/coded bits, and 1
Tc
“channel
uses”/s.
◮ Note that k
n = Tc
Ts
.
Maximum achievable rate is C
Tc
coded bits/s.
If H(ζ)
Ts
= 1
Ts
≤ C
Tc
, we could find a coding scheme so that Pe is made
arbitrarily small (so small as desired).
◮ This means that an appropriate coding scheme has to meet k
n ≤ C in
order to exploit the possibilities of the noisy-channel coding theorem.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 36 / 62
37. Discrete memoryless channels
Noisy-channel coding theorem
The theorem also states that, if a bit error probability of Pb is accept-
able, coding rates up to
R (Pb) = C
1−H(Pb)
are achievable. R greater than that cannot be achieved with the given
bit error probability.
In a binary symmetric channel without noise (error probability p = 0),
it can be demonstrated
C = maxpX
[I (X; Y)] = 1 bits/“channel use”.
x0=0
x1=1
y0=0
y1=1
1
1
Figure 7: Diagram showing the errorfree binary symmetric channel.
P =
1 0
0 1
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 37 / 62
38. Discrete memoryless channels
Noisy-channel coding theorem
In a binary symmetric channel with error probability p = 0,
C = 1 − H (p) =
= 1 − p · log2
1
p + (1 − p) · log2
1
1−p bits/“channel use”.
replacements
x0=0
x1=1
y0=0
y1=1
1−p
1−p
p
p
Figure 8: Diagram showing the binary symmetric channel
-BSC(p).
P =
1 − p p
p 1 − p
In the binay erasure channel with erasure probability p,
C = maxpX
[I (X; Y)] = 1 − p bits/“channel use”.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 38 / 62
39. Entropy and mutual information for continuous RRVV
Entropy and mutual information
for continuous RRVV
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 39 / 62
40. Entropy and mutual information for continuous RRVV
Differential entropy
Differential entropy or continuous entropy of a continuous RV X
with pdf fX (x) is defined as
h (X) =
∞
−∞
fX (x) · log2
1
fX(x) dx
It does not measure an absolute quantity of information, hence the
differential term (it could take values lower than 0).
The differential entropy of a continuous random vector
−→
X with joint
pdf f−→
X
−→x is defined as
h
−→
X =
−→x
f−→
X
−→x · log2
1
f−→
X
(−→x )
d−→x
−→
X = (X0, · · · , XN−1) ; f−→
X
−→x = f−→
X
(x0, · · · , xN−1)
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 40 / 62
41. Entropy and mutual information for continuous RRVV
Differential entropy
For a given variance value σ2, the Gaussian RV exhibits the largest
achievable differential entropy.
◮ This means the Gaussian RV has a special place in the domain of con-
tinuous RRVV within Information Theory.
Properties of differential entropy:
◮ Differential entropy is invariant under translations
h (X + c) = h (X)
◮ Scaling
h (a · X) = h (X) + log2 (|a|)
h A ·
−→
X = h
−→
X + log2 (|A|)
◮ For given variance σ2
, a Gaussian RV X with variance σ2
X = σ2
and any
other RV Y with variance σ2
Y = σ2
, h (X) ≥ h (Y).
◮ For a Gaussian RV X with variance σ2
X
h (X) = 1
2 log2 2π e σ2
X
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 41 / 62
42. Entropy and mutual information for continuous RRVV
Mutual information for continuous RRVV
Mutual information of two continuous RRVV X and Y:
I (X; Y) =
∞
−∞
∞
−∞
fX,Y (x, y) · log2
fX(x|y)
fX(x) dxdy
fX,Y (x, y) is X and Y joint pdf, and fX (x|y) is the conditional pdf of
X given Y = y.
Properties:
◮ Symmetry, I (X; Y) = I (Y; X).
◮ Non-negative, I (X; Y) ≥ 0.
◮ I (X; Y) = h (X) − h (X|Y).
◮ I (X; Y) = h (Y) − h (Y|X).
h (X|Y) =
∞
−∞
∞
−∞
fX,Y (x, y) · log2
1
fX(x|y) dxdy
This is the conditional differential entropy of X given Y.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 42 / 62
43. AWGN channel capacity theorem
AWGN channel capacity theorem
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 43 / 62
44. AWGN channel capacity theorem
Continuous channel capacity
Consider a Gaussian discrete memoryless channel, described by
◮ x(t) is a stocastic stationary process, with mx = 0 and bandwidth Wx =
B Hz.
◮ Process is sampled at with sampling period Ts = 1
2B , and Xk = x (k · Ts)
are thus a bunch of continuous RRVV ∀ k, with E [Xk ] = 0.
◮ A RV Xk is transmitted each Ts seconds over a noisy channel with
bandwidth B, during a total of T seconds (n = 2BT total samples).
◮ The channel is AWGN, adding noise samples described by RRVV Nk
with mn = 0 and Sn (f ) = N0
2 , so that σ2
n = N0B.
◮ The received samples are statistically independent RRVV, described as
Yk = Xk + Nk .
◮ The cost function for any maximization of the mutual information is the
signal power E X2
k = S.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 44 / 62
45. AWGN channel capacity theorem
Continuous channel capacity
The channel capacity is defined as
C = maxfXk
(x) I (Xk; Yk) : E X2
k = S
◮ I (Xk ; Yk ) = h (Yk ) − h (Yk |Xk ) = h (Yk ) − h (Nk )
◮ Maximum is only reached if h (Yk ) is maximized.
◮ This only happens if fXk
(x) is Gaussian!
◮ Therefore, C = I (Xk ; Yk ) with Xk Gaussian and E X2
k = S.
E Y2
k = S + σ2
n, then h (Yk) = 1
2 log2 2π e S + σ2
n .
h (Nk) = 1
2 log2 2π e σ2
n .
C = I (Xk; Yk) = h (Yk ) − h (Nk) = 1
2 log2 1 + S
σ2
n
bits/“channel
use”.
Finally, C (bits/s) = n
T · C (bits/“channel use”)
C = B · log2 1 + S
N0B bits/s; note that n
T = 1
Ts
“channel use”/s
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 45 / 62
46. AWGN channel capacity theorem
Shannon-Hartley theorem
The Shannon-Hartley theorem states that the capacity of a bandlim-
ited AWGN channel with bandwidth B and power spectral density N0/2
is
C = B · log2 1 + S
N0B bits/s
This is the highest possible information transmission rate over this ana-
log communication channel, accomplished with arbitrarily small error
probability.
Capacity increases (almost) linearly with B, whereas S determines only
a logarithmic increase.
◮ Increasing available bandwidth has far larger impact on capacity than
increasing transmission power.
The bandlimited, power constrained AWGN channel is a very conve-
nient model for real-world communications.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 46 / 62
47. AWGN channel capacity theorem
Implications of the channel capacity theorem
Consider an ideal system where Rb = C.
S = Eb · C, where Eb is the average bit energy.
C
B = log2 1 + Eb
N0
C
B → Eb
N0
= 2
C
B −1
C
B
If we represent the spectral efficiency η = Rb
B as a function of Eb
N0
, the
previous expression is an asymptotic curve on such plane that marks
the border between the reliable zone, and the unrealiable zone.
◮ This curve helps us to identify the parameter set for a communication
system so that it may achieve reliable transmission with a given quality
(measured in terms of a limited, maximum error rate).
When B → ∞, Eb
N0 ∞
= ln (2) = −1.6 dB.
◮ This limit is known as the “Shannon limit” for the AWGN channel.
◮ The capacity in the limit is C∞ = S
N0
log2 (e).
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 47 / 62
48. AWGN channel capacity theorem
Channel capacity tradeoffs
Figure 9: Working regions as determined by the Shannon-Hartley theorem (source: www.gaussianwaves.com). No
channel coding applied.
The diagram illustrates possible tradeoffs, involving Eb
N0
, Rb
B
and Pb.
Note that the spectral efficiency of the modulations is given for optimal pulse shap-
ing.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 48 / 62
49. AWGN channel capacity theorem
Channel capacity tradeoffs
The lower, left hand side of the plot is the so-called power limited
region.
◮ There, the Eb
N0
is very poor and we have to sacrifice spectral efficiency to
get a given transmission quality (Pb).
◮ An example of this are deep space communications, where the SNR
received is extremely low due to the huge free space losses in the link.
The only way to get a reliable transmission is to drop data rate at very
low values.
The upper, right hand side of the plot is de so-called bandwidth limited
region.
◮ There, the desired spectral efficiency Rb
B for fixed B (desired data rate)
is traded-off against unconstrained transmission power (unconstrained
Eb
N0
), under a given Pb.
◮ An example of this would be a terrestrial DVB transmitting station,
where Rb
B is fixed (standarized), and where the transmitting power is
only limited by regulatory or technological constraints.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 49 / 62
50. Bridging towards channel coding
Bridging towards channel coding
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 50 / 62
51. Bridging towards channel coding
Discrete-input Continuous-output memoryless channels
We have considered, on the one hand, discrete input/output memory-
less channels, and, on the other hand, purely Gaussian channels without
paying attention to the input source.
Normal mode of operation in digital communications is characterized
by a discrete source (digital modulation, with symbols si ), that goes
through an AWGN channel (that adds noise samples n).
Therefore, it makes sense to consider real digital communication pro-
cesses as instances of Discrete-input Continuous-output memory-
less channels (DCMC).
It may be demonstrated that, for an M-ary digital modulation going
through an AWGN channel, with outputs r = si + n, the channel
capacity is
C = log2 (M) − T Eb
N0
bits per “channel use”
where T (·) is a funtion that has to be evaluated numerically.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 51 / 62
52. Bridging towards channel coding
Continuous-output vs Discrete-output capacity
−2 0 2 4 6 8 10 12
0
0.2
0.4
0.6
0.8
1
1.2
E
b
/N
0
(dB)
Spectralefficiency(bps/Hz)
DCMC BPSK Capacity
Hard decided (DMC) BPSK Capacity
AWGN Capacity
Figure 10: Capacity of BPSK DCMC against hard decided BPSK.
For low Eb/N0, there is higher capacity available for DCMC.
As Eb/N0 grows, DCMC and DMC capacities converge.
There is around 1 − 2 dB to be gained with the DCMC in the Eb/N0 range of
interest.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 52 / 62
53. Bridging towards channel coding
Channel capacity tradeoffs and channel coding
Guaranteeing a given target Pb further limits the attainable zone in the spectral
efficiency/signal-to-noise ratio plane, depending on the framework chosen.
For fixed spectral efficiency (fixed Rb
B
), we move along a horizontal line where we
manage the Pb versus Eb
N0
tradeoff.
For fixed signal-to-noise ratio (fixed Eb
N0
), we move along a vertical line where we
manage the Pb versus Rb
B
tradeoff.
Figure 11: Working regions for given transmission schemes. Capacities are for DCMC (source:www.comtechefdata.com).
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 53 / 62
54. Bridging towards channel coding
Hard demodulation
Hard decision maximum likeli-
hood (ML) demodulation:
◮ Apply decision region criterion.
◮ Decided symbols are converted back
into mapped bits.
◮ The hard demodulator thus outputs
a sequence of decided bits.
Hard decision ML demodulation
fails to capture the extra infor-
mation contained in the received
constellation
◮ Distance from the received symbol
to the hard decided symbol may pro-
vide a measure of the uncertainty
and reliabitily on the decision.
p(yk|xj )
Figure 12: 16-QAM received constellation with noise.
How could we take advantage of the available extra information?
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 54 / 62
55. Bridging towards channel coding
Hard demodulation
We may use basic information about the channel and about the trans-
mitted modulation to build new significant demodulation values.
Consider a standard modulation and an AWGN channel
r = si + n
where si is the symbol transmitted in the signal space and n is the
AWGN sample corresponding to the vector channel.
Test hypothesis for demodulation:
P (bk = b|r) =
fr(r|bk =b)P(bk =b)
fr(r)
;
P (bk = b|r) =
P(bk =b)
fr(r) sj /bk =b
fr (r|sj) =
P(bk =b)
fr(r) sj /bk =b
1√
2πσ2
n
e−|sj −r|2
/2σ2
n
Note that fr (r) does not depend on the hypothesis bk = b, b = 0, 1.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 55 / 62
56. Bridging towards channel coding
Log-Likelihood Ratios
When input bits are equiprobable, we may resort to the quotient
λk = log P(bk =1|r)
P(bk =0|r) = log fr(r|bk=1)
fr(r|bk=0) =
= log
sj /bk =1
e
−|sj −r|
2
/2σ2
n
sj /bk =0
e
−|sj −r|
2
/2σ2
n
Note that fr (r|bk = b) is a likelihood function.
λk is called the Log-Likelihood Ratio (LLR) on bit bk.
◮ Its sign could be used in hard decision to decide whether bk is more likely
to be a 1 (λk > 0) or a 0 (λk < 0).
◮ The modulus of λk could be used as a measure of the reliability of the
hypothesis on bk .
◮ A larger |λk | means a larger reliability on a possible decision.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 56 / 62
57. Bridging towards channel coding
Soft demodulation
We have seen that basic knowledge about the transmission system and
the channel may lead to the calculation of probabilistic (soft) values
related to a possible decision, under suitable hypothesis.
The calculated LLR values for the received bit stream may provide
valuable insight and information for the next subsystem in the receiver
chain: the channel decoder.
If input bits are not equiprobable, we may still used a soft value char-
acterization, in the form of quotient of demodulation posteriori values
log
P (bk = 1|r)
P (bk = 0|r)
= log
P (bk = 1)
P (bk = 0)
+ log
fr (r|bk = 1)
fr (r|bk = 0)
where the LLR is updated with the a priori probabilities P (bk = b).
We are going to see in the next lesson how soft demodulated values play
a key role in the most powerful channel encoding/decoding methods.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 57 / 62
59. Conclusions
Conclusions
Information Theory represents a cutting-edge research field with appli-
cations in communications, artificial intelligence, data mining, machine
learning, robotics...
We have seen three fundamental results from Shannon’s 1948 seminal
work, that constitute the foundations of all modern communications.
◮ Source coding theorem, that states the limits and possibilities of loss-
less data compresion.
◮ Noisy-channel coding theorem, that states the need of channel coding
techniques to achieve a given performance, using constrained resources.
It establishes the asymptotic possibility of errorfree transmission over
discrete-input discrete-output noisy channels.
◮ Shannon-Hartley theorem, which establishes the absolute (asymp-
totic) limits for errorfree transmission over AWGN channels, and de-
scribes the different tradeoffs involved among the given resources.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 59 / 62
60. Conclusions
Conclusions
All these results build the attainable working zone for practical and
feasible communication systems, managing and trading-off constrained
resources and under a given target performance (BER).
◮ The η = Rb
B against Eb
N0
plane (under a target BER) constitutes the
playground for designing and bringing into practice any communication
standards.
◮ Any movement over the plane has a direct impact over business and
revenues in the telco domain.
When addressing practical designs in communications, these results and
limits are not much heeded, but they underlie all of them.
◮ There are lots of common practice and common wisdom rules of thumb
in the domain, stating what to use when (regarding modulations, channel
encoders and so on).
◮ Nevertheless, optimizing the designs so as to profit as much as possible
from all the resources at hand require making these limits explicit.
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 60 / 62
62. References
Bibliography I
[1] J. M. Cioffi, Digital Communications - coding (course). Stanford University, 2010. [Online]. Available:
http://web.stanford.edu/group/cioffi/book
[2] T. M. Cover and J. A. Thomas, Elements of Information Theory. New Jersey: John Wiley & Sons, Inc., 2006.
[3] D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [Online]. Available:
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html
[4] S. Haykin, Communications Systems. New York: John Wiley & Sons, Inc., 2001.
[5] Claude E. Shannon, “A mathematical theory of communication,” Bell Systems Technical Journal. [Online]. Available:
http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 62 / 62