Introduction to Evidential Neural Networks

Educational material for students at Cardiﬀ University, UK and the University of Brescia,
Italy.
2

Bayesian Probabilities
Bayes theorem
p(Y |X) =
p(X|Y )p(Y )
p(X)
(1)
where
p(X) =
Y
p(X|Y )p(Y ) (2)
4

Suppose we randomly pick one of the boxes and from that
box we randomly select an item of fruit, and having observed
which sort of fruit it is we replace it in the box from which it
came.
We could imagine repeating this process many times. Let us
suppose that in so doing we pick the red box 40% of the time
and we pick the blue box 60% of the time, and that when we
remove an item of fruit from a box we are equally likely to
select any of the pieces of fruit in the box.
We are told that a piece of fruit has been selected and it is an orange.
Which box does it came from?
5
Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).

p(B = r|F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
=
6
8 · 4
10
6
8 · 4
10 + 1
4 · 6
10
=
=
3
4
·
2
5
·
20
9
=
2
3
(3)
6

Given the parameters of our model w, we can capture our assumptions about w, before
observing the data, in the form of a prior probability distribution p(w). The eﬀect of the
observed data D = {t1, . . . , tN } is expressed through the conditional p(D |w), hence Bayes
theorem takes the form:
p(w|D ) =
likelihood
p(D |w)
prior
p(w)
p(D )
(4)
posterior ∝ likelihood · prior (5)
p(D ) = p(D |w)p(w)dw (6)
It ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one.
7

Frequentist paradigm
• w is considered to be a ﬁxed parameter,
whose values is determined by some
form of estimator, e.g. the maximum
likelihood in which w is set to the value
that maximises p(D |w)
• Error bars on this estimate are obtained
by considering the distribution of
possible data sets D .
• The negative log of the likelihood
function is called an error function: the
negative log is a monotonically
decreasing function hence maximising
the likelihood is equivalent to
minimising the error.
Bayesian paradigm
• There is only one single data set D (the
one observed) and the uncertainty in the
parameters is expressed through a
probability distribution over w.
• The inclusion of prior knowledge arises
naturally: suppose that a fair-looking
coin is tossed three times and lands
heads each time. A classical maximum
likelihood estimate of the probability of
landing heads would give 1.
There are cases where you want to
reduce the dependence on the prior,
hence using noninformative priors.
8

Binary variable: Bernoulli
Let us consider a single binary random variable x ∈ {0, 1}, e.g. ﬂipping coin, not necessary
fair, hence the probability is conditioned by a parameter 0 ≤ µ ≤ 1:
p(x = 1|µ) = µ (7)
The probability distribution over x is known as the Bernoulli distribution:
Bern(x|µ) = µx
(1 − µ)1−x
(8)
E[x] = µ (9)
9

Binomial distribution
The distribution of m observations of x = 1 given the datasize N is given by the Binomial
distribution:
Bin(m|N, µ) =
N
m
µm
(1 − µ)N−m
(10)
with
E[m] ≡
N
m=0
mBin(m|N, µ) = Nµ (11)
and
var[m] ≡
N
m=0
(m − E[m])2
Bin(m|N, µ) = Nµ(1 − µ) (12)
10

How many times, over N = 10 runs, would you see x = 1 if µ = 0.25?
m
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
11

Let’s go back to the Bernoulli distribution
Now suppose that we have a data set of observations x = (x1, . . . , xN )T
drawn independently
from a Bernoulli distribution (iid) whose mean µ is unknown, and we would like to
determine this parameter from the data set.
p(D |µ) =
N
n=1
p(xn|µ) =
N
n=1
µxn
(1 − µ)1−xn
(13)
Let’s maximise the (log)-likelihood to identify the parameter (log simpliﬁes and reduces risks
of underﬂow):
ln p(D |µ) =
N
n=1
ln p(xn|µ) =
N
n=1
{xn ln µ + (1 − xn) ln(1 − µ)} (14)
12

The log likelihood depends on the N observations xn only through their sum
n
xn, hence
the sum provides an example of a suﬃcient statistics for the data under this distribution,
“
hence no other statistic that can be calculated from the same sample provides
any additional information as to the value of the parameter
Fisher 1922„
13

d
dµ
ln p(D |µ) = 0
N
n=1
xn
µ
−
1 − xn
1 − µ
= 0
N
n=1
xn − µ
µ(1 − µ)
= 0
N
n=1
xn = Nµ
µML =
1
N
N
n=1
xn
aka sample mean. Risk of overﬁt: consider to toss the coin three times and each time is head
14

In order to develop a Bayesian treatment to the overﬁt problem of the maximum likelihood
estimator for the Bernoulli. Since the likelihood takes the form of the product of factors of
the form µx
(1 − µ)1−x
, if we choose a prior to be proportional to powers of µ and (1 − µ) then
the posterior distribution, proportional to the product of the prior and the likelihood, will
have the same functional form as the prior. This property is called conjugacy.
15

Binary variables: Beta distribution
Beta(µ|a, b) =
Γ(a + b)
Γ(a)Γ(b)
µa−1
(1 − µ)b−1
with
Γ(x) ≡
∞
0
ux−1
e−u
du
E[µ] =
a
a + b
var[µ] =
ab
(a + b)2(a + b + 1)
a and b are hyperparameters controlling the distribution of parameter µ.
16

µ
a = 0.1
b = 0.1
0 0.5 1
0
1
2
3
µ
a = 1
b = 1
0 0.5 1
0
1
2
3
µ
a = 2
b = 3
0 0.5 1
0
1
2
3
µ
a = 8
b = 4
0 0.5 1
0
1
2
3
17
Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).

Considering a beta distribution prior and the binomial likelihood function, and given
l = N − m
p(µ|m, l, a, b) ∝ µm+a−1
(1 − µ)l+b−1
Hence p(µ|m, l, a, b) is another beta distribution and we can rearrange the normalisation
coeﬃcient as follows:
p(µ|m, l, a, b) =
Γ(m + a + l + b)
Γ(m + a)Γ(l + b)
µm+a−1
(1 − µ)l+b−1
µ
prior
0 0.5 1
0
1
2
µ
likelihood function
0 0.5 1
0
1
2
µ
posterior
0 0.5 1
0
1
2
18
Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).

Epistemic vs Aleatoric uncertainty
Aleatoric uncertainty
Variability in the outcome of an experiment
which is due to inherently random eﬀects
(e.g. ﬂipping a fair coin): no additional
source of information but Laplace’s daemon
can reduce such a variability.
Epistemic uncertainty
Epistemic state of the agent using the model,
hence its lack of knowledge that—in
principle—can be reduced on the basis of
additional data samples.
It is a general property of Bayesian learning
that, as we observe more and more data, the
epistemic uncertainty represented by the
posterior distribution will steadily decrease
(the variance decreases).
19

See notebook at
https://nbviewer.jupyter.org/federicocerutti/UncertaintyAwarenessResources/
blob/master/notebooks/Beta.ipynb
20

Multinomial variables: categorical distribution
Let us suppose to roll a dice with K = 6 faces. An observation of this variable x equivalent
to x3 = 1 (e.g. the number 3 with face up) can be:
x = (0, 0, 1, 0, 0, 0)T
Note that such vectors must satisfy
K
k=1
xk = 1.
p(x|µ) =
K
k=1
µxk
k
where µ = (µ1, . . . , µK )T
, nad the parameters µk are such that µk ≥ 0 and
k
µk = 1.
Generalisation of the Bernoulli
21

p(D |µ) =
N
n=1
K
k=1
µxnk
k
The likelihood depends on the N datapoints only through the K quantities
mk =
n
xnk
which represent the number of observations of xk = 1 (e.g. with k = 3, the third face of the
dice). These are called the suﬃcient statistics for this distribution.
22

Finding the maximum likelihood requires a Lagrange multiplier that
K
x=1
mk ln µk + λ
K
k=1
µk − 1
Hence
µML
k =
mk
N
which is the fraction of N observations for which xk = 1.
23

Multinomial variables: the Dirichlet distribution
The Dirichlet distribution is the generalisation of the beta distribution to K dimensions.
Dir(µ|α) =
Γ(α0)
Γ(α1) · · · Γ(αK )
K
k=1
µαk −1
k
such that
k
µk = 1, α = (α1, . . . , αK )T
, αk ≥ 0 and
α0 =
K
k=1
αk
24

Considering a Dirichlet distribution prior and the categorical likelihood function, the
posterior is then:
p(µ|D , α) = Dir(µ|α + m) =
=
Γ(α0 + N)
Γ(α1 + m1) · · · Γ(αK + mK )
K
k=1
µαk +mk −1
k
The uniform prior is given by Dir(µ|1) and the Jeﬀreys’ non-informative prior is given by
Dir(µ|(0.5, . . . , 0.5)T
).
The marginals of a Dirichlet distribution are beta distributions.
25

neural networks and uncertainty awareness

Change the loss function so to output pieces of evidences in favour of diﬀerent classes that
should then be considered through Bayesian update resulting into a Dirichlet Distribution
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classiﬁcation
uncertainty.” Advances in Neural Information Processing Systems. 2018.
27

From Evidence to Dirichlet
Let us now assume a Dirichlet distribution over K classes that is the result of Bayesian
update with N observations and starting with a uniform prior:
Dir(µ | α) = Dir(µ | e1 + 1, e2 + 2, . . . , eK + 1 )
where ei is the number of observations (evidence) for the class k, and
k
ek = N.
28

Dirichlet and Epistemic Uncertainty
The epistemic uncertainty associated to a Dirichlet distribution Dir(µ | α) is given by
u =
K
S
with K the number of classes and S = α0 =
K
k=1
αk is the Dirichlet strength.
Note that if the Dirichlet has been computed as the resulting of Bayesian update from a
uniform prior, 0 ≤ u ≤ 1, and u = 1 implies that we are considering the uniform distribution
(an extreme case of Dirichlet distribution).
Let us denote with µk
αk
S
.
29

Loss function
If we then consider Dir(mi | αi ) as the prior for a Multinomial p(yi | µi ), we can then compute the
expected squared error (aka Brier score)
E[ yi − mi
2
2
] =
K
k=1
E[y2
i,k − 2yi,k µi,k + µ2
i,k ] =
X
k=1
y2
i,k − 2yi,k E[µi,k ] + E[µ2
i,k ] =
=
K
k=1
y2
i,k − 2yi,k E[µi,k ] + E[µi,k ]2
+ var[µi,k ] =
=
K
k=1
(yi,k − E[µi,k ])2
+ var[µi,k ] =
=
K
k=1
yi,k −
αi,k
Si
2
+
αi,k (Si − αi,k )
S2
i (Si + 1)
=
=
K
k=1
(yi,k − µi,k )2
+
µi,k (1 − µi,k )2
Si + 1
The loss over a batch of training samples is the sum of the loss for each sample in the batch.
30

Learning to say “I don’t know”
To avoid generating evidence for all the classes when the network cannot classify a given
sample (epistemic uncertainty), we introduce a term in the loss function that penalises the
divergence from the uniform distribution:
L =
N
i=1
E[ yi − µi
2
2
] + λt
N
i=1
KL ( Dir(µi | αi ) || Dir(µi | 1) )
where:
• λt is another hyperparameter, and the suggestion is to use it parametric on the number of
training epochs, e.g. λt = min 1,
t
CONST
with t the number of current training epoch, so that
the eﬀect of the KL divergence is gradually increased to avoid premature convergence to the
uniform distribution in the early epoch where the learning algorithm still needs to explore the
parameter space;
• αi = yi + (1 − yi ) · αi are the Dirichlet parameters the neural network in a forward pass has put
on the wrong classes, and the idea is to minimise them as much as possible.
31

KL recap
Consider some unknown distribution p(x) and suppose that we have modelled this using
q(x). If we use q(x) instead of p(x) to represent the true values of x, the average additional
amount of information required is:
KL(p||q) = − p(x) ln q(x)dx − − p(x) ln p(x)dx
= − p(x) ln
q(x)
p(x)
dx
= −E ln
q(x)
p(x)
(15)
This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence
between the distributions p(x) and q(x).
Properties:
• KL(p||q) ≡ KL(q||p);
• KL(p||q) ≥ 0 and KL(p||q) = 0 if and only if p = q
32

KL ( Dir(µi | αi ) || Dir(µi | 1) ) = ln
Γ( K
k=1 αi,k )
Γ(K) K
k=1 Γ(αi,k )
+
K
k=1
(αi,k −1)

ψ(αi,k ) − ψ


K
j=1
αi,j




where ψ(x) =
d
dx
ln ( Γ(x) ) is the digamma function
33

EDL and robustness to FGS
34

EDL + GAN for adversarial training
Sensoy, Murat, et al. “Uncertainty-Aware Deep Classiﬁers using Generative Models.” AAAI 2020
35

aluation
VAE + GAN
G
D' D
For each data point in latent space, we generate a new noisy sample, which is similar
to it to some extent. Hence, we avoid mode-collapse problem.
be trivially predicted without learning the actual structure of the data. Similarly,
if the noise distribution is too close to the data distribution, the density ratio
would be trivially one and the learning will be deprived.
G: Generator in the latent space of VAE
D’: Discriminator in the latent space
D : Discriminator in the input space
Figure 2: Original training samples (top), samples recon-
structed by the VAE (middle), and the samples generated by
the proposed method (bottom) over a number of epochs.
for high dimensional data by maximizingSensoy, Murat, et al. “Uncertainty-Aware Deep Classiﬁers using Generative Models.” AAAI 2020
36

Robustness against FGS
37

Anomaly detection
(mnist) (cifar10)
38

Introduction to Evidential Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Evidential Neural Networks

Similar to Introduction to Evidential Neural Networks (20)

More from Federico Cerutti

More from Federico Cerutti (20)

Recently uploaded

Recently uploaded (20)

Introduction to Evidential Neural Networks