Itblock2 150209161919-conversion-gate01

Block 2: Introduction to Information Theory
Francisco J. Escribano
October 25, 2016
Francisco J. Escribano Block 2: Introduction to Information Theory October 25, 2016 1 / 62

Table of contents
1 Motivation
2 Entropy
3 Source coding
4 Mutual information
5 Discrete memoryless channels
6 Entropy and mutual information for continuous RRVV
7 AWGN channel capacity theorem
8 Bridging towards channel coding
9 Conclusions
10 References

Motivation
Motivation

Motivation
Motivation
Information Theory is a discipline established during the 2nd half of the
XXth Century.
It relies on solid mathematical foundations [1, 2, 3, 4].
It tries to address two basic questions:
◮ To what extent can we compress data for a more eﬃcient usage of the
limited communication resources?
→ Entropy
◮ Which is the largest possible data transfer rate for given resources and
conditions?
→ Channel capacity
Key concepts for Information Theory are thus entropy (H (X)) and
mutual information (I (X; Y)).
◮ X, Y are random variables (RRVV) of some kind.
◮ Information will naturally be characterized by means of RRVV.

Motivation
Motivation
Up to the 40’s, it was common wisdom in telecommunications that the error rate
increased with increasing data rate.
◮ Claude Shannon demonstrated that errorfree transmission may be possible under
certain conditions.
Information Theory provides strict bounds for any communication system.
◮ Maximum compression → H (X); or minimum I X; ˆX , with controlled distortion.
◮ Maximum transfer rate → maximum I (X; Y).
◮ Any given communication system works between said limits.
The mathematics behind is not always constructive, but provides guidelines to design
algorithms to improve communications given a set of available resources.
◮ The resources in this context are parameters such as available transmission power,
available bandwidth, signal-to-noise ratio and the like.
X YX ˆX
ChannelCompression

Entropy
Entropy

Entropy
Entropy
Consider a discrete memoryless data source, that issues a symbol from
a given set at a given symbol rate, chosen randomly and independently
from the previous and the subsequent ones.
ζ = {s0, · · · , sK−1} , P (S = sk) = pk,
k = 0, 1, · · · , K − 1; K is the source radix
Information quantity is a random variable deﬁned as
I (sk) = log2
1
pk
with properties
I (sk) = 0 if pk = 1
I (sk) > I (si ) if pk < pi
I (sk) ≥ 0, 0 ≤ pk ≤ 1
I (sl , sk) = I (sl ) + I (sk)
Deﬁnition is conventional but also convenient: it relates uncertainty
and information.

Entropy
Entropy
The source entropy is a measurement of its “information content”,
and it is deﬁned as
H (ζ) = E{pk } [I (sk)] =
K−1
k=0
pk · I (sk)
H (ζ) =
K−1
k=0
pk · log2
1
pk
0 ≤ H (ζ) ≤ log2 (K)
pj = 1 ∧ pk = 0, k = j
H (ζ) = 0
pk = 1
K , k = 0, · · · , K − 1
H (ζ) = log2 (K)
There is no potential information (uncertainty) in deterministic (“de-
generated”) sources.
Entropy is conventionally measured in “bits”.

Entropy
Entropy
E.g. memoryless binary source: H (p) = −p · log2 (p) − (1 − p) · log2 (1 − p).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1H(p)
p
Figure 1: Entropy of a memoryless binary source.

Source coding
Source coding

Source coding
Source coding
The field of source coding addresses the issues related to handling the
output data from a given source, from the point of view of Information
Theory.
One of the main issues is data compression, its theoretical limits and
the related practical algorithms.
◮ The most important prerequisite in communications at the PHY is to
keep data integrity → any transformation has to be fully invertible (loss-
less compression).
◮ Lossy compression is a key field at other communication levels: for ex-
ample, compressing a stream of video to MPEG-2 standard, to create a
transport frame.
◮ In other related fields, like cryptography, some non-invertible algorithms
for data compression are of utmost interest (i.e. hash functions, to create
almost-unique tokens).

Source coding
Source coding
We may choose to represent the data from the source by asigning to
each symbol sk a corresponding codeword (binary, in our case).
The aim of this source coding process is to try to represent the source
symbols more eﬃciently.
◮ We focus on variable-length binary, invertible codes → the codewords
are unique blocks of binary symbols of length lk , for each symbol sk.
◮ The correspondence codeword ↔ original symbol constitutes a code.
Average codeword length:
L = E [lk] =
K−1
k=0
pk · lk
Coding eﬃciency:
η = Lmin
L
≤ 1

Source coding
Shannon’s source coding theorem
This theorem establishes the limits for lossless data compression [5].
◮ N iid1
random variables with entropy H (ζ) each, can be compressed
into N · H (ζ) bits with negligible information loss risk as N → ∞. If
they are compressed into less than N · H (ζ) bits it is certain that some
information will be lost.
In practical terms, for a single random variable, this means
Lmin ≥ H (ζ)
And, therefore, the coding eﬃciency can be deﬁned as
η = H(ζ)
L
Like other results in Information Theory, this theorem provides the
limits, but does not tell how actually we can reach them.
1
Independent and identically distributed.

Source coding
Example of source coding: Huffman coding
Huffman coding provides a practical algorithm to perform source cod-
ing within the limits shown.
It is an instance of a class of codes, called prefix codes
◮ No binary word within the codeset is the prefix of any other one.
Properties:
◮ Unique coding.
◮ Instantaneous decoding.
◮ The lengths lk meet the Kraft-McMillan inequality [2]:
K−1
k=0
2−lk
≤ 1.
◮ L of the code is bounded by
H (ζ) ≤ L < H (ζ) + 1 H (ζ) = L ↔ pk = 2−lk
Meeting the Kraft-McMillan inequality guarantees that a prefix code
with the given lk can be constructed.

Source coding
Huffman coding
Algorithm to perform Huffman coding:
1 List symbols sk in a column, in order of decreasing probabilities.
2 Compose the probabilities of the last 2 symbols: probabilities are added
into a new dummy compound value/symbol.
3 Reorder the probabilities set in an adjacent column, putting the new
dummy symbol probability as high as possible, retiring values involved.
4 In the process of moving probabilities to the right, keep the values of the
unaffected symbols (making room to the compound value if needed),
but assign a 0 to one of the symbols affected, a 1 to the other (top or
bottom, but keep always the same criterion along the process).
5 Start afresh the process in the new column.
6 When only two probabilities remain, assign last 0, 1 and stop.
7 Write the binary codeword corresponding to each original symbol by trac-
ing back the trajectory of each of the original symbols and the dummy
symbols they take part in, from the last towards the initial column.

Source coding
Huffman coding example
Figure 2: Example of Huffman coding: assigning final codeword patterns proceeds from right to left.
To characterize the resulting code, it is important to calculate:
H (ζ) ; L =
K−1
k=0
pk · lk ; η =
H(ζ)
L
; σ2
=
K−1
k=0
pk · lk − L
2

Source coding
Managing efficiency in source coding
From the point of view of data communication (sequences of data
symbols), it is useful to define the n-th extension of the source ζ:
◮ A symbol of the n-th extension of the source is built by taking n succes-
sive symbols from it (symbols in ζn
).
◮ If source is iid, for si ∈ ζ, the new extended source ζn
is characterized
by the probabilities:
P (σ = (s1 · · · sn) ∈ ζn
) =
n
i=1 P (si ∈ ζ)
◮ Given that the sequence si in σ ∈ ζn
is independent and identically
distributed (iid) for any n, the entroy of this new source is:
H (ζn
) = n · H (ζ)
By applying Huffman coding to ζn, we derive a code with average
length L characterized by:
H (ζn
) ≤ L < H (ζn
) + 1

Source coding
Managing efficiency in source coding
Considering the equivalent word length per original symbol
Ls = L
n → H (ζ) ≤ Ls < H (ζ) + 1
n
The final efficiency turns out to be
η = H(ζn)
L
= n·H(ζ)
L
= H(ζ)
Ls
Efficiency tends thus to 1 as n → ∞, as shown by the previous inequal-
ity.
The method of the n-th extension of the source, nevertheless, is ren-
dered impractical for large n, due to:
◮ Increasing delay.
◮ Increasing coding and decoding complexity.

Mutual information
Mutual information

Mutual information
Joint entropy
We extend the concept of entropy to 2 RRVV.
◮ 2 or more RRVV are needed when analyzing communication channels
from the point of view of Information Theory.
◮ These RRVV can also be seen as a random vector.
◮ The underlying concept is the characterization of channel input vs chan-
nel output, and what we can get about the former by observing the
latter.
Joint entropy of 2 RRVV:
H (X, Y) =
x∈X y∈Y
p (x, y) · log2
1
p(x,y)
H (X, Y) = Ep(x,y) log2
1
P(X,Y)
Properties:
H (X, Y) ≥ max [H (X) , H (Y)] ; H (X, Y) ≤ H (X) + H (Y)

Mutual information
Chain rule
Relation between joint and conditional entropy of 2 RRVV:
H (X, Y) = H (X) + H (Y|X) .
The expression points towards the following common wisdom result:
“joint knowledge about X and Y is the knowledge about X plus the
information in Y not related to X”.
Proof:
p (x, y) = p (x) · p (y|x) ;
log2 (p (x, y)) = log2 (p (x)) + log2 (p (y|x)) ;
Ep(x,y) [log2 (P (X, Y))] = Ep(x,y) [log2 (P (X))] +
+Ep(x,y) [log2 (P (Y|X))] .
Corollary: H (X, Y|Z) = H (X|Z) + H (Y|X, Z).
As another consequence: H (Y|X) ≤ H (Y).

Mutual information
Relative entropy
H (X) measures the quantity of information needed to describe the RV
X on average.
The relative entropy D (p q) measures the increase in information
needed to describe X when it is characterized by means of a distribution
q (x) instead of p (x).
X; p (x) → H (X)
X; q (x) → H (X) + D (p q)
Deﬁnition of relative entropy (aka Kullback-Leibler divergence, or
improperly “distance”):
D (p q) =
x∈X
p (x) · log2
p(x)
q(x) = Ep(x) log2
P(X)
Q(X)
Note that: limx→0 (x · log (x)) = 0; 0 · log 0
0 = 0; p · log p
0 = ∞.

Mutual information
Relative entropy and mutual information
Properties of relative entropy:
1 D (p q) ≥ 0.
2 D (p q) = 0 ↔ p (x) = q (x).
3 It is not symmetric. Therefore, it is not a true distance from the metrical
point of view.
Mutual information of 2 RRVV:
I (X; Y) = H (X) − H (X|Y) =
=
x∈X y∈Y
p (x, y) · log2
p(x,y)
p(x)·p(y) =
= D (p (x, y) p (x) · p (y)) = Ep(x,y) log2
P(X,Y)
P(X)·P(Y) .
The mutual information between X and Y is the information in X,
minus the information in X not related to Y.

Mutual information
Mutual information
Properties of mutual information:
1 Symmetry: I (X; Y) = I (Y; X).
2 Non negative: I (X; Y) ≥ 0.
3 I (X; Y) = 0 ↔ X and Y are independent.
}H (Y|X)I (X; Y)
H (X) H (Y)
H (X, Y)
H (X|Y)
Figure 3: Relationship between entropy and mutual information.

Mutual information
A glimpse of lossy compression
Now that we have deﬁned mutual information, we can examine lossy
compression under the so-called Rate-Distortion Theory.
A source X is compressed into a new source ˆX, a process characterized
by the probabilities
p (x) ; p (ˆx, x) = p (ˆx|x) p (x)
The quality of the compression is measured according to a distortion
parameter d (x, ˆx) and its expected value D:
D = Ep(ˆx,x) [d (x, ˆx)] = x ˆx d (x, ˆx) p (ˆx|x) p (x)
d (x, ˆx) is any convenient distance measurement
◮ The Euclidean distance, d (x, ˆx) = |x − ˆx|2
.
◮ The Hamming distance, d (x, ˆx) = 1 if x = ˆx, and 0 otherwise.

Mutual information
A glimpse of losssy compression
In lossy compression, we want ˆX to be represented by the lowest number
of bits, and this means ﬁnding a distribution p (ˆx|x) such that
R = minp(ˆx|x) I X; ˆX = minp(ˆx|x) H (X) − H X|ˆX
R is the rate in bits of the compressed source.
The source is compressed to an entropy lower than H (X): since the
process is lossy, H X|ˆX = 0.
The minimum is trivially 0 when X and ˆX are fully unrelated, so that
the minimization is performed subject to D ≤ D∗, for a maximum
admissible distortion D∗.
Conversely, the problem can be addressed as the minimization of the
distortion D subject to I X; ˆX ≤ R∗, for a maximum admissible rate
R∗ < H (X).

Discrete memoryless channels

Communication channel:
◮ Input/output system where the output is a probabilistic function of the
input.
A DMC (Discrete Memoryless Channel) consist in
◮ Input alphabet X = {x0, x1, · · · , xJ−1}, corresponding to RV X.
◮ Output alphabet Y = {y0, y1, · · · , yK−1}, corresponding to RV Y, noisy
version of RV X.
◮ A set of transition probabilities linking input and output, following
{p (yk |xj)} k=0,1,··· ,K−1; j=0,1,···J−1
p (yk |xj ) = P (Y = yk |X = xj )
◮ X and Y are ﬁnite and discrete.
◮ The channel is memoryless, since the output symbol depends only on
the current input symbol.

Output Y is probabilistically determined by the input (a priori) proba-
bilities and the channel matrix, following
pX = (p (x0) , p (x1) , · · · , p (xJ−1)), and p (xj) = P (X = xj)
pY = (p (y0) , p (y1) , · · · , p (yK−1)), and p (yk) = P (Y = yk)
p (yk) =
J−1
j=0
p (yk|xj ) · p (xj ), ∀ k = 0, 1 · · · , K − 1
Therefore, pY = pX · P
When J = K and yj is the correct choice when sending xj, we can
calculate the average symbol error probability as
Pe =
J−1
j=0
p (xj) ·
K−1
k=0,k=j
p (yk|xj )
The probability of correct transmission is 1 − Pe.

Example of discrete memoryless channel: modulation with hard deci-
sion.
xj
yk
Figure 4: 16-QAM transmitted constellation.
p(yk|xj )
Figure 5: 16-QAM received constellation with noise.

Disccrete memoryless channels
Example of non-symmetric channel → the binary erasure channel, typ-
ical of storage systems.
◮ Reading data in storage systems can also be modeled as sending infor-
mation through a channel with given probabilistic properties.
◮ An erasure marks the complete uncertainty over the symbol read.
◮ There are methods to recover from erasures, based on the principles of
Information Theory.
x0=0
x1=1
y0=0
y1=ǫ
y2=1
1−p
1−p
p
p
Figure 6: Diagram showing the binary erasure channel.
P =
1 − p p 0
0 p 1 − p

Channel capacity
Mutual information depends on P and pX. Characterizing the possibil-
ities of the channel requires removing dependency with pX.
Channel capacity is deﬁned as
C = maxpX
[I (X; Y)]
◮ It is the maximum average mutual information enabled by the channel,
in bits per channel use, and the best we can get out of it in point of
reliable information transfer.
◮ It only depends on the channel transition probabilities P.
◮ If the channel is symmetric, the distribution pX that maximizes I (X; Y)
is the uniform one (equiprobable symbols).
Channel coding is a process where controled redundancy is added to
protect data integrity.
◮ A channel encoded information sequence is represented by a block of n
bits, encoded from an information sequence represented by k ≤ n bits.
◮ The so-called code rate is calculated as R = k/n ≤ 1.

Noisy-channel coding theorem
Consider a discrete source ζ, emitting symbols with period Ts.
◮ The binary information rate of such source is H(ζ)
Ts
(bits/s).
Consider a discrete memoryless channel, used to send coded data each
Tc seconds.
◮ The maximum possible data transfer rate would be C
Tc
(bits/s).
The noisy-channel coding theorem states the following [5]:
◮ If H(ζ)
Ts
≤ C
Tc
, there exists a coding scheme that guarantees errorfree
transmission (i.e. Pe arbitrarily small).
◮ Conversely, if H(ζ)
Ts
> C
Tc
, the communication cannot be made reliable
(i.e. we cannot have a bounded Pe, so small as desired).
Please note that again the theorem is asymptotic, and not constructive:
it does not say how to actually reach the limit.

Example: a binary symmetric channel.
Consider a binary source ζ = {0, 1}, with equiprobable symbols.
◮ H (ζ) = 1 info bits/“channel use”.
◮ Source works at a rate of 1
Ts
“channel uses”/s, and H(ζ)
Ts
info bits/s.
Consider an encoder with rate k
n info/coded bits, and 1
Tc
“channel
uses”/s.
◮ Note that k
n = Tc
Ts
.
Maximum achievable rate is C
Tc
coded bits/s.
If H(ζ)
Ts
= 1
Ts
≤ C
Tc
, we could ﬁnd a coding scheme so that Pe is made
arbitrarily small (so small as desired).
◮ This means that an appropriate coding scheme has to meet k
n ≤ C in
order to exploit the possibilities of the noisy-channel coding theorem.

The theorem also states that, if a bit error probability of Pb is accept-
able, coding rates up to
R (Pb) = C
1−H(Pb)
are achievable. R greater than that cannot be achieved with the given
bit error probability.
In a binary symmetric channel without noise (error probability p = 0),
it can be demonstrated
C = maxpX
[I (X; Y)] = 1 bits/“channel use”.
x0=0
x1=1
y0=0
y1=1
1
1
Figure 7: Diagram showing the errorfree binary symmetric channel.
P =
1 0
0 1

In a binary symmetric channel with error probability p = 0,
C = 1 − H (p) =
= 1 − p · log2
1
p + (1 − p) · log2
1
1−p bits/“channel use”.
replacements
x0=0
x1=1
y0=0
y1=1
1−p
1−p
p
p
Figure 8: Diagram showing the binary symmetric channel
-BSC(p).
P =
1 − p p
p 1 − p
In the binay erasure channel with erasure probability p,
C = maxpX
[I (X; Y)] = 1 − p bits/“channel use”.

Entropy and mutual information for continuous RRVV
Entropy and mutual information
for continuous RRVV

Differential entropy
Differential entropy or continuous entropy of a continuous RV X
with pdf fX (x) is defined as
h (X) =
∞
−∞
fX (x) · log2
1
fX(x) dx
It does not measure an absolute quantity of information, hence the
differential term (it could take values lower than 0).
The differential entropy of a continuous random vector
−→
X with joint
pdf f−→
X
−→x is defined as
h
−→
X =
−→x
f−→
X
−→x · log2
1
f−→
X
(−→x )
d−→x
−→
X = (X0, · · · , XN−1) ; f−→
X
−→x = f−→
X
(x0, · · · , xN−1)

Differential entropy
For a given variance value σ2, the Gaussian RV exhibits the largest
achievable differential entropy.
◮ This means the Gaussian RV has a special place in the domain of con-
tinuous RRVV within Information Theory.
Properties of differential entropy:
◮ Differential entropy is invariant under translations
h (X + c) = h (X)
◮ Scaling
h (a · X) = h (X) + log2 (|a|)
h A ·
−→
X = h
−→
X + log2 (|A|)
◮ For given variance σ2
, a Gaussian RV X with variance σ2
X = σ2
and any
other RV Y with variance σ2
Y = σ2
, h (X) ≥ h (Y).
◮ For a Gaussian RV X with variance σ2
X
h (X) = 1
2 log2 2π e σ2
X

Mutual information for continuous RRVV
Mutual information of two continuous RRVV X and Y:
I (X; Y) =
∞
−∞
∞
−∞
fX,Y (x, y) · log2
fX(x|y)
fX(x) dxdy
fX,Y (x, y) is X and Y joint pdf, and fX (x|y) is the conditional pdf of
X given Y = y.
Properties:
◮ Symmetry, I (X; Y) = I (Y; X).
◮ Non-negative, I (X; Y) ≥ 0.
◮ I (X; Y) = h (X) − h (X|Y).
◮ I (X; Y) = h (Y) − h (Y|X).
h (X|Y) =
∞
−∞
∞
−∞
fX,Y (x, y) · log2
1
fX(x|y) dxdy
This is the conditional diﬀerential entropy of X given Y.

AWGN channel capacity theorem

Continuous channel capacity
Consider a Gaussian discrete memoryless channel, described by
◮ x(t) is a stocastic stationary process, with mx = 0 and bandwidth Wx =
B Hz.
◮ Process is sampled at with sampling period Ts = 1
2B , and Xk = x (k · Ts)
are thus a bunch of continuous RRVV ∀ k, with E [Xk ] = 0.
◮ A RV Xk is transmitted each Ts seconds over a noisy channel with
bandwidth B, during a total of T seconds (n = 2BT total samples).
◮ The channel is AWGN, adding noise samples described by RRVV Nk
with mn = 0 and Sn (f ) = N0
2 , so that σ2
n = N0B.
◮ The received samples are statistically independent RRVV, described as
Yk = Xk + Nk .
◮ The cost function for any maximization of the mutual information is the
signal power E X2
k = S.

Continuous channel capacity
The channel capacity is deﬁned as
C = maxfXk
(x) I (Xk; Yk) : E X2
k = S
◮ I (Xk ; Yk ) = h (Yk ) − h (Yk |Xk ) = h (Yk ) − h (Nk )
◮ Maximum is only reached if h (Yk ) is maximized.
◮ This only happens if fXk
(x) is Gaussian!
◮ Therefore, C = I (Xk ; Yk ) with Xk Gaussian and E X2
k = S.
E Y2
k = S + σ2
n, then h (Yk) = 1
2 log2 2π e S + σ2
n .
h (Nk) = 1
2 log2 2π e σ2
n .
C = I (Xk; Yk) = h (Yk ) − h (Nk) = 1
2 log2 1 + S
σ2
n
bits/“channel
use”.
Finally, C (bits/s) = n
T · C (bits/“channel use”)
C = B · log2 1 + S
N0B bits/s; note that n
T = 1
Ts
“channel use”/s

Shannon-Hartley theorem
The Shannon-Hartley theorem states that the capacity of a bandlim-
ited AWGN channel with bandwidth B and power spectral density N0/2
is
C = B · log2 1 + S
N0B bits/s
This is the highest possible information transmission rate over this ana-
log communication channel, accomplished with arbitrarily small error
probability.
Capacity increases (almost) linearly with B, whereas S determines only
a logarithmic increase.
◮ Increasing available bandwidth has far larger impact on capacity than
increasing transmission power.
The bandlimited, power constrained AWGN channel is a very conve-
nient model for real-world communications.

Implications of the channel capacity theorem
Consider an ideal system where Rb = C.
S = Eb · C, where Eb is the average bit energy.
C
B = log2 1 + Eb
N0
C
B → Eb
N0
= 2
C
B −1
C
B
If we represent the spectral eﬃciency η = Rb
B as a function of Eb
N0
, the
previous expression is an asymptotic curve on such plane that marks
the border between the reliable zone, and the unrealiable zone.
◮ This curve helps us to identify the parameter set for a communication
system so that it may achieve reliable transmission with a given quality
(measured in terms of a limited, maximum error rate).
When B → ∞, Eb
N0 ∞
= ln (2) = −1.6 dB.
◮ This limit is known as the “Shannon limit” for the AWGN channel.
◮ The capacity in the limit is C∞ = S
N0
log2 (e).

Channel capacity tradeoffs
Figure 9: Working regions as determined by the Shannon-Hartley theorem (source: www.gaussianwaves.com). No
channel coding applied.
The diagram illustrates possible tradeoffs, involving Eb
N0
, Rb
B
and Pb.
Note that the spectral efficiency of the modulations is given for optimal pulse shap-
ing.

Channel capacity tradeoffs
The lower, left hand side of the plot is the so-called power limited
region.
◮ There, the Eb
N0
is very poor and we have to sacrifice spectral efficiency to
get a given transmission quality (Pb).
◮ An example of this are deep space communications, where the SNR
received is extremely low due to the huge free space losses in the link.
The only way to get a reliable transmission is to drop data rate at very
low values.
The upper, right hand side of the plot is de so-called bandwidth limited
region.
◮ There, the desired spectral efficiency Rb
B for fixed B (desired data rate)
is traded-off against unconstrained transmission power (unconstrained
Eb
N0
), under a given Pb.
◮ An example of this would be a terrestrial DVB transmitting station,
where Rb
B is fixed (standarized), and where the transmitting power is
only limited by regulatory or technological constraints.

Bridging towards channel coding

Discrete-input Continuous-output memoryless channels
We have considered, on the one hand, discrete input/output memory-
less channels, and, on the other hand, purely Gaussian channels without
paying attention to the input source.
Normal mode of operation in digital communications is characterized
by a discrete source (digital modulation, with symbols si ), that goes
through an AWGN channel (that adds noise samples n).
Therefore, it makes sense to consider real digital communication pro-
cesses as instances of Discrete-input Continuous-output memory-
less channels (DCMC).
It may be demonstrated that, for an M-ary digital modulation going
through an AWGN channel, with outputs r = si + n, the channel
capacity is
C = log2 (M) − T Eb
N0
bits per “channel use”
where T (·) is a funtion that has to be evaluated numerically.

Continuous-output vs Discrete-output capacity
−2 0 2 4 6 8 10 12
0
0.2
0.4
0.6
0.8
1
1.2
E
b
/N
0
(dB)
Spectralefficiency(bps/Hz)
DCMC BPSK Capacity
Hard decided (DMC) BPSK Capacity
AWGN Capacity
Figure 10: Capacity of BPSK DCMC against hard decided BPSK.
For low Eb/N0, there is higher capacity available for DCMC.
As Eb/N0 grows, DCMC and DMC capacities converge.
There is around 1 − 2 dB to be gained with the DCMC in the Eb/N0 range of
interest.

Channel capacity tradeoffs and channel coding
Guaranteeing a given target Pb further limits the attainable zone in the spectral
efficiency/signal-to-noise ratio plane, depending on the framework chosen.
For fixed spectral efficiency (fixed Rb
B
), we move along a horizontal line where we
manage the Pb versus Eb
N0
tradeoff.
For fixed signal-to-noise ratio (fixed Eb
N0
), we move along a vertical line where we
manage the Pb versus Rb
B
tradeoff.
Figure 11: Working regions for given transmission schemes. Capacities are for DCMC (source:www.comtechefdata.com).

Hard demodulation
Hard decision maximum likeli-
hood (ML) demodulation:
◮ Apply decision region criterion.
◮ Decided symbols are converted back
into mapped bits.
◮ The hard demodulator thus outputs
a sequence of decided bits.
Hard decision ML demodulation
fails to capture the extra infor-
mation contained in the received
constellation
◮ Distance from the received symbol
to the hard decided symbol may pro-
vide a measure of the uncertainty
and reliabitily on the decision.
p(yk|xj )
Figure 12: 16-QAM received constellation with noise.
How could we take advantage of the available extra information?

Hard demodulation
We may use basic information about the channel and about the trans-
mitted modulation to build new signiﬁcant demodulation values.
Consider a standard modulation and an AWGN channel
r = si + n
where si is the symbol transmitted in the signal space and n is the
AWGN sample corresponding to the vector channel.
Test hypothesis for demodulation:
P (bk = b|r) =
fr(r|bk =b)P(bk =b)
fr(r)
;
P (bk = b|r) =
P(bk =b)
fr(r) sj /bk =b
fr (r|sj) =
P(bk =b)
fr(r) sj /bk =b
1√
2πσ2
n
e−|sj −r|2
/2σ2
n
Note that fr (r) does not depend on the hypothesis bk = b, b = 0, 1.

Log-Likelihood Ratios
When input bits are equiprobable, we may resort to the quotient
λk = log P(bk =1|r)
P(bk =0|r) = log fr(r|bk=1)
fr(r|bk=0) =
= log



sj /bk =1
e
−|sj −r|
2
/2σ2
n
sj /bk =0
e
−|sj −r|
2
/2σ2
n



Note that fr (r|bk = b) is a likelihood function.
λk is called the Log-Likelihood Ratio (LLR) on bit bk.
◮ Its sign could be used in hard decision to decide whether bk is more likely
to be a 1 (λk > 0) or a 0 (λk < 0).
◮ The modulus of λk could be used as a measure of the reliability of the
hypothesis on bk .
◮ A larger |λk | means a larger reliability on a possible decision.

Soft demodulation
We have seen that basic knowledge about the transmission system and
the channel may lead to the calculation of probabilistic (soft) values
related to a possible decision, under suitable hypothesis.
The calculated LLR values for the received bit stream may provide
valuable insight and information for the next subsystem in the receiver
chain: the channel decoder.
If input bits are not equiprobable, we may still used a soft value char-
acterization, in the form of quotient of demodulation posteriori values
log
P (bk = 1|r)
P (bk = 0|r)
= log
P (bk = 1)
P (bk = 0)
+ log
fr (r|bk = 1)
fr (r|bk = 0)
where the LLR is updated with the a priori probabilities P (bk = b).
We are going to see in the next lesson how soft demodulated values play
a key role in the most powerful channel encoding/decoding methods.

Conclusions
Conclusions

Conclusions
Conclusions
Information Theory represents a cutting-edge research field with appli-
cations in communications, artificial intelligence, data mining, machine
learning, robotics...
We have seen three fundamental results from Shannon’s 1948 seminal
work, that constitute the foundations of all modern communications.
◮ Source coding theorem, that states the limits and possibilities of loss-
less data compresion.
◮ Noisy-channel coding theorem, that states the need of channel coding
techniques to achieve a given performance, using constrained resources.
It establishes the asymptotic possibility of errorfree transmission over
discrete-input discrete-output noisy channels.
◮ Shannon-Hartley theorem, which establishes the absolute (asymp-
totic) limits for errorfree transmission over AWGN channels, and de-
scribes the different tradeoffs involved among the given resources.

Conclusions
Conclusions
All these results build the attainable working zone for practical and
feasible communication systems, managing and trading-oﬀ constrained
resources and under a given target performance (BER).
◮ The η = Rb
B against Eb
N0
plane (under a target BER) constitutes the
playground for designing and bringing into practice any communication
standards.
◮ Any movement over the plane has a direct impact over business and
revenues in the telco domain.
When addressing practical designs in communications, these results and
limits are not much heeded, but they underlie all of them.
◮ There are lots of common practice and common wisdom rules of thumb
in the domain, stating what to use when (regarding modulations, channel
encoders and so on).
◮ Nevertheless, optimizing the designs so as to proﬁt as much as possible
from all the resources at hand require making these limits explicit.

References
References

References
Bibliography I
[1] J. M. Cioﬃ, Digital Communications - coding (course). Stanford University, 2010. [Online]. Available:
http://web.stanford.edu/group/cioﬃ/book
[2] T. M. Cover and J. A. Thomas, Elements of Information Theory. New Jersey: John Wiley & Sons, Inc., 2006.
[3] D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [Online]. Available:
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html
[4] S. Haykin, Communications Systems. New York: John Wiley & Sons, Inc., 2001.
[5] Claude E. Shannon, “A mathematical theory of communication,” Bell Systems Technical Journal. [Online]. Available:
http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Itblock2 150209161919-conversion-gate01

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Itblock2 150209161919-conversion-gate01

Similar to Itblock2 150209161919-conversion-gate01 (20)

Recently uploaded

Recently uploaded (20)

Itblock2 150209161919-conversion-gate01