La quatrième séance de lecture de livres en machine learning.
Vidéo : https://youtu.be/Ab5RvD7ieFg
Elle concernera une brève introduction à la théorie de l'information: Entropy, K-L divergence, mutual Information,... et son application dans la fonction de perte et notamment la cross-entropy.
Lecture de trois livres, dans le cadre de "Monday reading books on machine learning".
Le premier livre, qui constituera le fil conducteur de toute l'action :
Christopher Bishop; Pattern Recognition and Machine Learning, Springer-Verlag New York Inc, 2006
Seront utilisées des parties de deux livres, surtout du livre :
Ian Goodfellow, Yoshua Bengio, Aaron Courville; Deep Learning, The MIT Press, 2016
et du livre :
Ovidiu Calin; Deep Learning Architectures: A Mathematical Approach, Springer, 2020
Mrbml004 : Introduction to Information Theory for Machine Learning
1. Monday reading books
on Machine Learning
JAOUAD DABOUNOU
FST of Settat
Hassan 1st University
February 21, 2022
004 – Introduction
Probability Theory
2. 2
Introduction
A partir de ce lundi 31 janvier, une lecture de trois livres, dans le cadre de "Monday reading books on
machine learning".
Le premier livre, qui constituera le fil conducteur de toute l'action :
Christopher Bishop; Pattern Recognition and Machine Learning, Springer-Verlag New York Inc, 2006
Seront utilisées des parties de deux livres, surtout du livre :
Ian Goodfellow, Yoshua Bengio, Aaron Courville; Deep Learning, The MIT Press, 2016
et du livre :
Ovidiu Calin; Deep Learning Architectures: A Mathematical Approach, Springer, 2020
5. Consider two random variables X for Fruit and Y for Box.
X can take the values x1 = 'o' and x2 = 'a'.
Y can take the values y1 = 'r', y2 = 'b', y3 = 'br', y4 = 'v' and y5 = 'y' corresponding to the box color.
5
Probability Theory
blue brown
red yellow
violet
orange
apple
X: Fruit Y: Box
6. We will introduce some basic concepts of probability theory and information theory by considering the simple
example of fruits and boxes.
The probability distribution for a random variable describes how the probabilities are distributed over the values
of the random variable. It is the mathematical function that gives the probabilities of occurrence of different
possible outcomes.
6
Probability distribution
p(X='o') = 1
p(X='a') = 0
Probability distribution
7. We will introduce some basic concepts of probability theory and information theory by considering the simple
example of fruits and boxes.
The probability distribution for a random variable describes how the probabilities are distributed over the values
of the random variable. It is the mathematical function that gives the probabilities of occurrence of different
possible outcomes.
7
Probability distribution
p(X='o') = 0.5
p(X='a') = 0.5
Probability distribution
8. We will introduce some basic concepts of probability theory and information theory by considering the simple
example of fruits and boxes.
The probability distribution for a random variable describes how the probabilities are distributed over the values
of the random variable. It is the mathematical function that gives the probabilities of occurrence of different
possible outcomes.
8
Probability distribution
p(X='o') = 0.75
p(X='a') = 0.25
Probability distribution
Probability distribution can be used to quantify the relative
frequency of occurrences of uncertain events.
Probability distribution is a part of measurement
uncertainty analysis.
9. Information theory is the mathematical approach for the quantification,
storage and communication of digital information.
9
Information theory
Claude Shannon (1916 - 2001)
10. Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
10
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No surprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise
11. Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
11
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No surprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise
12. Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
12
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise
13. Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
13
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise
14. Associated with information theory are the concepts of probability, uncertainty, communication and noise in
data.
14
Information theory
Low uncertainty
High Knowledge
Low information
Low entropy
No suprise
High uncertainty
Low Knowledge
High information
High entropy Great surprise
15. The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x. If we are told that a
highly improbable event has just occurred, we will have received more information than if we were told that some very
likely event has just occurred, and if we knew that the event was certain to happen we would receive no information.
Our measure of information content will therefore depend on the probability distribution p(x), and we therefore look
for a quantity h(x) that is a monotonic function of the probability p(x) and that expresses the information content.
15
Information theory
p(X='o') = 1
Probability
h(X='o') = -log2 p(x) = 0
Information
p(X='a') = 0.5
h(X='a') = -log2 p(x) = 1
p(X='a') = 0.125
h(X='a') = -log2 p(x) = 3
Amount of uncertainty
16. Entropy is a probabilistic measure of uncertainty or ignorance. Information is a measure of a reduction in that
uncertainty.
16
Entropy
p(X='o') = 0.875
Probability
h(X='o') = -log2 p(x) = 0.193
Information
p(X='a') = 0.125
p(X='a') = -log2 p(x) = 3
Given a probability distribution p(X), entropy H of the system can then be expressed as :
𝐻 𝑋 = −
𝑘=1,𝐾
𝑝 𝑥𝑘 log(𝑝 𝑥𝑘 )
17. Entropy H(X) reaches its maximum value if all outcomes of the random variable X have the same probability.
H(X) expresses the uncertainty or ignorance about the system outcomes. H(X) = 0, if and only if the probability of an
outcome is 1 and of all other is 0.
17
Entropy
H(X) = 0
Entropy H(X) = 0.54 H(X) = 0.81 H(X) = 0.91 H(X) = 1
Entropy can be considered as a measure of variability in a system.
No uncertainty Maximum uncertainty
18. 𝑝(𝑐𝑎𝑡) =
5
20
= 0.25
Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
18
Probability Theory
𝑝(𝑒𝑙𝑒𝑝ℎ𝑎𝑛𝑡) =
4
20
= 0.2
𝑝(𝑑𝑜𝑔) =
7
20
= 0.35
𝑝(ℎ𝑜𝑟𝑠𝑒) =
4
20
= 0.2
H(X) = −
𝑘=1,𝐾
𝑝 𝑥𝑘 log 𝑝 𝑥𝑘 = 1.96
19. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
19
Probability Theory
20. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
20
Probability Theory
21. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
21
Probability Theory
22. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
22
Probability Theory
23. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
23
Probability Theory
c d h e c d c c h e d d h e e
Let
'c' = 'cat'
'e' = 'elephant'
'h' = 'horse'
'd' = dog'.
24. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
24
Probability Theory
c d h e c d c c h e d d h e e
1
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
25. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
25
Probability Theory
26. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
26
Probability Theory
Sample 1 : s1
27. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
27
Probability Theory
h
0
0
1
0
0.07
0.01
0.6
0.3
s1
Sample 1 : s1
28. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
28
Probability Theory
h e
0
0
1
0
0
1
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
Sample 2 : s2
s1 s2
29. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
29
Probability Theory
h e c
0
0
1
0
0
1
0
0
1
0
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
Sample 3 : s3
s1 s2 s3
30. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
30
Probability Theory
h e c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
Sample 4 : s4
s1 s2 s3 s4
31. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
31
Probability Theory
h e c d c
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
Sample 5 : s5
s1 s2 s3 s4 s5
32. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
32
Probability Theory
h e c d c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
0.28
0.02
0.1
0.6
Sample 6 : s6
s1 s2 s3 s4 s5 s6
33. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
33
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
We want to use a metric that allows
us to estimate the deviation of the
probability distribution q from the
probability distribution p.
p is the true probability distribution
q is the predicted probability distribution
𝑝(𝑥1|𝑠1)
𝑝(𝑥2|𝑠1)
𝑝(𝑥3|𝑠1)
𝑝(𝑥4|𝑠1)
𝑞(𝑥1|𝑠1)
𝑞(𝑥2|𝑠1)
𝑞(𝑥3|𝑠1)
𝑞(𝑥4|𝑠1)
s1
Sample 1 : s1
34. We want to use a metric that allows us to estimate the deviation of the probability distribution q from the probability
distribution p. For simplification purposes, we put 𝑝 𝑥1 = 𝑝 𝑥1 𝑠1 and 𝑞 𝑥1 = 𝑞(𝑥1|𝑠1)
34
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
p is the true probability distribution
q is the predicted probability distribution
𝑝(𝑥1)
𝑝(𝑥2)
𝑝(𝑥3)
𝑝(𝑥4)
𝑞(𝑥1)
𝑞(𝑥2)
𝑞(𝑥3)
𝑞(𝑥4)
s1
Sample 1 : s1
35. We want to use a metric that allows us to estimate the deviation of the probability distribution q from the probability
distribution p.
35
K-L Divergence
h
0
0
1
0
0.07
0.01
0.6
0.3
p is the true probability distribution
q is the predicted probability distribution
p q
Distance entre deux
distribution de probabilité
𝐷𝐾𝐿(𝑝| 𝑞 =
𝑘=1,𝐾
𝑝 𝑥𝑘 log
𝑝 𝑥𝑘
𝑞 𝑥𝑘
For K classes x1,…xK
𝑝(𝑥1)
𝑝(𝑥2)
𝑝(𝑥3)
𝑝(𝑥4)
𝑞(𝑥1)
𝑞(𝑥2)
𝑞(𝑥3)
𝑞(𝑥4)
36. 𝑝(𝑥1|𝑠𝑖)
𝑝(𝑥2|𝑠𝑖)
𝑝(𝑥3|𝑠𝑖)
𝑝(𝑥4|𝑠𝑖)
𝑞(𝑥1|𝑠𝑖)
𝑞(𝑥2|𝑠𝑖)
𝑞(𝑥3|𝑠𝑖)
𝑞(𝑥4|𝑠𝑖)
We can also estimate the deviation of the probability distribution q from the probability distribution p using N samples.
36
K-L Divergence
h e c d c d
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0.07
0.01
0.6
0.3
0.03
0.8
0.1
0.07
0.4
0.05
0.05
0.5
0.4
0.01
0.09
0.5
0.6
0.02
0.03
0.35
0.28
0.02
0.1
0.6
Probability Distribution
𝐷𝐾𝐿(𝑝| 𝑞 =
1
𝑁
𝑖=1,𝑁 𝑘=1,𝐾
𝑝 𝑥𝑘|𝑠𝑖 log
𝑝 𝑥𝑘|𝑠𝑖
𝑞 𝑥𝑘|𝑠𝑖
37. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
37
K-L Divergence for Neural Networks
Dataset
0.07
0.01
0.6
0.3
0
0
1
0
38. Consider here a random variable X for Animal. X can take the values x1 = 'cat', x2 = 'elephant', x3 = 'horse' and x4 = dog'.
We make the assumption of independent and identically distributed outcomes.
38
K-L Divergence for Neural Networks
Dataset
0.6
0.02
0.03
0.35
1
0
0
0