Mrbml004 : Introduction to Information Theory for Machine Learning

Monday reading books
on Machine Learning
JAOUAD DABOUNOU
FST of Settat
Hassan 1st University
February 21, 2022
004 – Introduction
Probability Theory

2
Introduction
A partir de ce lundi 31 janvier, une lecture de trois livres, dans le cadre de "Monday reading books on
machine learning".
Le premier livre, qui constituera le fil conducteur de toute l'action :
Christopher Bishop; Pattern Recognition and Machine Learning, Springer-Verlag New York Inc, 2006
Seront utilisées des parties de deux livres, surtout du livre :
Ian Goodfellow, Yoshua Bengio, Aaron Courville; Deep Learning, The MIT Press, 2016
et du livre :
Ovidiu Calin; Deep Learning Architectures: A Mathematical Approach, Springer, 2020

Consider two random variables X for Fruit and Y for Box.
X can take the values x1 = 'o' and x2 = 'a'.
Y can take the values y1 = 'r', y2 = 'b', y3 = 'br', y4 = 'v' and y5 = 'y' corresponding to the box color.
5
Probability Theory
blue brown
red yellow
violet
orange
apple
X: Fruit Y: Box

We will introduce some basic concepts of probability theory and information theory by considering the simple
example of fruits and boxes.
The probability distribution for a random variable describes how the probabilities are distributed over the values
of the random variable. It is the mathematical function that gives the probabilities of occurrence of different
possible outcomes.
6
Probability distribution
p(X='o') = 1
p(X='a') = 0

possible outcomes.
7
p(X='o') = 0.5
p(X='a') = 0.5

possible outcomes.
8
p(X='o') = 0.75
p(X='a') = 0.25
Probability distribution can be used to quantify the relative
frequency of occurrences of uncertain events.
Probability distribution is a part of measurement
uncertainty analysis.

Information theory is the mathematical approach for the quantification,
storage and communication of digital information.
9
Information theory
Claude Shannon (1916 - 2001)

Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
10
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No surprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise

data.
11
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No surprise
High uncertainty
Low Knowledge
High information

data.
12
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information

data.
13
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information

data.
14
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information

The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x. If we are told that a
highly improbable event has just occurred, we will have received more information than if we were told that some very
likely event has just occurred, and if we knew that the event was certain to happen we would receive no information.
Our measure of information content will therefore depend on the probability distribution p(x), and we therefore look
for a quantity h(x) that is a monotonic function of the probability p(x) and that expresses the information content.
15
Information theory
p(X='o') = 1
Probability
h(X='o') = -log2 p(x) = 0
Information
p(X='a') = 0.5
h(X='a') = -log2 p(x) = 1
p(X='a') = 0.125
h(X='a') = -log2 p(x) = 3
Amount of uncertainty

Entropy is a probabilistic measure of uncertainty or ignorance. Information is a measure of a reduction in that
uncertainty.
16
Entropy
p(X='o') = 0.875
Probability
h(X='o') = -log2 p(x) = 0.193
Information
p(X='a') = 0.125
p(X='a') = -log2 p(x) = 3
Given a probability distribution p(X), entropy H of the system can then be expressed as :
𝐻 𝑋 = −
𝑘=1,𝐾
𝑝 𝑥𝑘 log(𝑝 𝑥𝑘 )

Entropy H(X) reaches its maximum value if all outcomes of the random variable X have the same probability.
H(X) expresses the uncertainty or ignorance about the system outcomes. H(X) = 0, if and only if the probability of an
outcome is 1 and of all other is 0.
17
Entropy
H(X) = 0
Entropy H(X) = 0.54 H(X) = 0.81 H(X) = 0.91 H(X) = 1
Entropy can be considered as a measure of variability in a system.
No uncertainty Maximum uncertainty

𝑝(𝑐𝑎𝑡) =
5
20
= 0.25
Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
18
Probability Theory
𝑝(𝑒𝑙𝑒𝑝ℎ𝑎𝑛𝑡) =
4
20
= 0.2
𝑝(𝑑𝑜𝑔) =
7
20
= 0.35
𝑝(ℎ𝑜𝑟𝑠𝑒) =
4
20
= 0.2
H(X) = −
𝑘=1,𝐾
𝑝 𝑥𝑘 log 𝑝 𝑥𝑘 = 1.96

19
Probability Theory

20
Probability Theory

21
Probability Theory

22
Probability Theory

23
Probability Theory
c d h e c d c c h e d d h e e
Let
'c' = 'cat'
'e' = 'elephant'
'h' = 'horse'
'd' = dog'.

24
Probability Theory
c d h e c d c c h e d d h e e
1
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0

25
Probability Theory

26
Probability Theory
Sample 1 : s1

27
Probability Theory
h
0
0
1
0
0.07
0.01
0.6
0.3
s1
Sample 1 : s1

28
Probability Theory
h e
0
0
1
0
0
1
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
Sample 2 : s2
s1 s2

29
Probability Theory
h e c
0
0
1
0
0
1
0
0
1
0
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
Sample 3 : s3
s1 s2 s3

30
Probability Theory
h e c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
Sample 4 : s4
s1 s2 s3 s4

31
Probability Theory
h e c d c
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
Sample 5 : s5
s1 s2 s3 s4 s5

32
Probability Theory
h e c d c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
0.28
0.02
0.1
0.6
Sample 6 : s6
s1 s2 s3 s4 s5 s6

33
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
We want to use a metric that allows
us to estimate the deviation of the
probability distribution q from the
probability distribution p.
p is the true probability distribution
q is the predicted probability distribution
𝑝(𝑥1|𝑠1)
𝑝(𝑥2|𝑠1)
𝑝(𝑥3|𝑠1)
𝑝(𝑥4|𝑠1)
𝑞(𝑥1|𝑠1)
𝑞(𝑥2|𝑠1)
𝑞(𝑥3|𝑠1)
𝑞(𝑥4|𝑠1)
s1
Sample 1 : s1

We want to use a metric that allows us to estimate the deviation of the probability distribution q from the probability
distribution p. For simplification purposes, we put 𝑝 𝑥1 = 𝑝 𝑥1 𝑠1 and 𝑞 𝑥1 = 𝑞(𝑥1|𝑠1)
34
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
𝑝(𝑥1)
𝑝(𝑥2)
𝑝(𝑥3)
𝑝(𝑥4)
𝑞(𝑥1)
𝑞(𝑥2)
𝑞(𝑥3)
𝑞(𝑥4)
s1
Sample 1 : s1

We want to use a metric that allows us to estimate the deviation of the probability distribution q from the probability
distribution p.
35
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
p q
Distance entre deux
distribution de probabilité
𝐷𝐾𝐿(𝑝| 𝑞 =
𝑘=1,𝐾
𝑝 𝑥𝑘 log
𝑝 𝑥𝑘
𝑞 𝑥𝑘
For K classes x1,…xK
𝑝(𝑥1)
𝑝(𝑥2)
𝑝(𝑥3)
𝑝(𝑥4)
𝑞(𝑥1)
𝑞(𝑥2)
𝑞(𝑥3)
𝑞(𝑥4)

𝑝(𝑥1|𝑠𝑖)
𝑞(𝑥1|𝑠𝑖)
We can also estimate the deviation of the probability distribution q from the probability distribution p using N samples.
36
K-L Divergence
h e c d c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
0.28
0.02
0.1
0.6
Probability Distribution
𝐷𝐾𝐿(𝑝| 𝑞 =
1
𝑁
𝑖=1,𝑁 𝑘=1,𝐾
𝑝 𝑥𝑘|𝑠𝑖 log
𝑝 𝑥𝑘|𝑠𝑖
𝑞 𝑥𝑘|𝑠𝑖

37
K-L Divergence for Neural Networks
Dataset
0.07
0.01
0.6
0.3
0
0
1
0

38
K-L Divergence for Neural Networks
Dataset
0.6
0.02
0.03
0.35
1
0
0
0

Mrbml004 : Introduction to Information Theory for Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mrbml004 : Introduction to Information Theory for Machine Learning

Similar to Mrbml004 : Introduction to Information Theory for Machine Learning (20)

More from Jaouad Dabounou

More from Jaouad Dabounou (19)

Recently uploaded

Recently uploaded (20)

Mrbml004 : Introduction to Information Theory for Machine Learning