UCD_talk_nov_2020

Understanding Neural Networks Priors at the Units Level
&
Sub-Weibull property
Julyan Arbel
julyan.arbel@inria.fr www.julyanarbel.com,
Joint work with Mariia Vladimirova, St´ephane Girard and others
Inria Grenoble Rhˆone-Alpes
UCD Stat Seminar – November 23, 2020
1/32

3/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Outline
Introduction to Bayesian Deep Learning
Wide limit behavior of Bayesian Neural Networks
Sub-Weibull distributions

4/32
Bayesian approach to DL
The distinguishing feature of the Bayesian approach is marginalization instead of optimization.
Bayesian model averaging (BMA) We want to obtain a predictive distribution for x given data D:
p(x|D) =
Θ
p(x|θ)
model
p(θ|D)
posterior
dθ
This can also be a conditional predictive if we are in a regression or classiﬁcation problem
p(y|x, D) =
W
p(y|x, w)p(w|D) dw
Esp. hard with dim of W being of the order of 106
.
Cf. Wilson [2020] for a discussion & review.

5/32
Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base
hypothesis (or parameter) is correct given the amount of available data.

5/32
Uncertainty
Aleatoric uncertainty essentially, noise from the data measurements.

5/32
Uncertainty
Aleatoric uncertainty essentially, noise from the data measurements.
Thus, a Bayesian approach considers epistemic uncertainties in a principled way, where these
uncertainties are carried over to the posterior distribution on parameter space.

6/32
Link between Bayesian learning and regularized MLE
The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)

6/32
It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization,
not marginalization.

6/32
max
W
π(W|D) ∝ L(D|W)π(W)
min
W
− log L(D|W)− log π(W)
min
W
L(W) + λR(W)
L(W) is a loss function, R(W) is typically a norm on Rp
, regularizer.

6/32
max
W
min
W
min
W
L(W) + λR(W)
, regularizer.
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10

7/32
Challenges of Bayesian Deep Learning
• Striving to build more interpretable parameter priors
Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep
neural networks, and they represent an acceptable description of a priori beliefs.
Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a,
2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level
of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al.,
2019; Hafner et al., 2018].

7/32
2019; Hafner et al., 2018].
• Scaling-up algorithms for Bayesian deep learning

7/32
2019; Hafner et al., 2018].
• Scaling-up algorithms for Bayesian deep learning
• Gaining theoretical insight and principled uncertainty quantiﬁcation for deep learning

8/32
Early works
Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].

8/32
Early works
Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
They have shown in particular the inﬁnite width Gaussian process property of 1 hidden layer neural
networks.

Scaling-up algorithms for Bayesian deep learning
How to deal with the dealing with the huge dimensionality of Bayesian model averaging? There are a
variety of [scalable] approximate inference techniques available:
• Hamiltonian Monte Carlo (not scalable) [Neal, 1995]
• mean-ﬁeld variational inference [Hinton and Van Camp, 1993; Blundell et al., 2015]
• Monte Carlo dropout [Gal and Ghahramani, 2016]
• exploring the link between deep networks and Gaussian processes [Lee et al., 2018; Matthews
et al., 2018; Khan et al., 2019],
• iterative learning from small mini-batches [Welling and Teh, 2011],
• using weight-perturbation approaches [Khan et al., 2018],
• investigating the information contained in the stochastic gradient descent trajectory [Maddox
et al., 2019],
• exploiting properties of the loss landscape [Garipov et al., 2018], by focusing on subspaces of low
dimensionality that capture a large amount of the variability of the posterior distribution [Izmailov
et al., 2019],
• applying non-linear transformations for dimensionality reduction [Pradier et al., 2018]
9/32

10/32
Outline

11/32

12/32
Wide regime: inﬁnite number of hidden units in the layer
Theorem (Neal [1995])
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights
(A2) with properly scaled variances and
(A3) ReLU activation function.
Then conditional on input x, the marginal prior distribution of a unit u(2)
of 2-nd hidden layer
converges to a Gaussian process in a wide regime.

12/32
Wide regime: inﬁnite number of hidden units in the layer
Theorem (Neal [1995])
(A1) iid Gaussian priors on the weights
(A2) with properly scaled variances and
(A3) ReLU activation function.
Then conditional on input x, the marginal prior distribution of a unit u(2)
of 2-nd hidden layer
converges to a Gaussian process in a wide regime.
Proof sketch
• u(1)
∼ N.
• Components of u(1)
are iid ⇒ CLT.
• u(2)
∼ GP (from CLT).
• But components of u(2)
are dependent.
...
...
...
...
...
...
...
x
u(1)
∼ N u(2)
∼ GP
W(1)
W(2)

13/32
Wide regime: extension to deep networks
Lee et al. [2018]; Matthews et al. [2018]

14/32
Wide regime: useful for developping new theory
Schoenholz et al. [2017]; Hayou et al. [2019]

15/32
Gaussian process approximation
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation.
In International Conference on Learning Representations
• Prior on weights, w ∼ N(0, σ2
) iid
• Initialisation is a crucial step in deep NN
• ”Edge of Chaos” initialization can lead to good performances
Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural
networks training.
In International Conference on Machine Learning
• Prior on weights, w ∼ N(0, σ2
) iid
• Gaussian process approximation u ≈ GP(0, K ) marginally
• ”Edge of Chaos” initialization
Results:
• Smooth activation functions (e.g. ELU) are better than ReLU activation, especially if is large
• ”Edge of Chaos” accelerates the training and improves performances

16/32
Outline

17/32
Distribution families with respect to tail behavior
For all k ∈ N, k-th row moment: X k = E|X|k 1/k
Distribution Tail Moments
Sub-Gaussian F(x) ≤ e−λx2
X k ≤ C
√
k
Sub-Exponential F(x) ≤ e−λx
X k ≤ Ck
Sub-Weibull F(x) ≤ e−λx1/θ
X k ≤ Ckθ
Denoted by subW(θ), θ > 0 called tail parameter
See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull

17/32
Distribution families with respect to tail behavior
For all k ∈ N, k-th row moment: X k = E|X|k 1/k
Distribution Tail Moments
Sub-Gaussian F(x) ≤ e−λx2
X k ≤ C
√
k
Sub-Exponential F(x) ≤ e−λx
X k ≤ Ck
Sub-Weibull F(x) ≤ e−λx1/θ
X k ≤ Ckθ
Denoted by subW(θ), θ > 0 called tail parameter
−40 −20 0 20 40
0.00.20.40.60.81.0
x
P(X≥x)
θ = 1
θ = 2
θ = 3
θ = 4
θ = 5
See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull

18/32
Properties of sub-Weibull distributions
• Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2)
In particular: subW(1/2) = subGaussian, subW(1) = subExponential

18/32
• Optimality: tail parameter θ is optimal if X k kθ

18/32
• Optimality: tail parameter θ is optimal if X k kθ
• Closure under multiplication and addition: product and sum of sub-Weibull r.v. with tail
parameters θ1 and θ2 are resp. subW(θ1 + θ2) and subW(max(θ1, θ2)); optimal under independence

19/32
Weibull-tail versus sub-Weibull
Cumulative distribution function F or probability distribution function f of random variable X is a
Weibull-tail distribution if for x > 0 it satisﬁes
F(x) = 1 − F(x) = e−x1/θ
L(x)
,
or
f (x) = e−x1/θ ˜L(x)
,
where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail
coeﬃcient.

19/32
Weibull-tail versus sub-Weibull
Cumulative distribution function F or probability distribution function f of random variable X is a
Weibull-tail distribution if for x > 0 it satisﬁes
F(x) = 1 − F(x) = e−x1/θ
L(x)
,
or
f (x) = e−x1/θ ˜L(x)
,
where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail
coeﬃcient.
• Weibull-tail property provides a precise measure on the tails, but is useless for moments
sub-Weibull property it allows you to handle the moments, but is useless for tail assessment

20/32
Outline

21/32
Assumptions on neural network
To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are
required

21/32
required
(A1) Parameters. The weights w have i.i.d Gaussian prior
w ∼ N(0, σ2
)

21/32
required
(A1) Parameters. The weights w have i.i.d Gaussian prior
w ∼ N(0, σ2
)
(A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t.
|φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−,
|φ(u)| ≤ c2 + d2|u| for all u ∈ R.
Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
Nonlinearity does not harm the distributional tail:
φ(X) k X k , k ∈ N

22/32
Main theorem
Theorem (Vladimirova et al, 2019)

22/32
Main theorem
(A1) iid Gaussian priors on the weights and
(A2) nonlinearity satisfying envelope property.

22/32
Main theorem
Then conditional on input x, the marginal prior distribution of a unit u( )
of -th hidden layer is
sub-Weibull with optimal tail parameter θ = /2: π( )
(u) ∼ subW( /2)

22/32
Main theorem
(u) ∼ subW( /2)
...
...
...
...
...
...
...
...
x
u(1)
u(2)
u(3) u( )
input 1st
hid. lay. 2nd
hid. lay. 3rd
hid. lay. th
hid. lay.
subW(1
2
) subW(1) subW(3
2
) subW(2
)

22/32
Main theorem
(u) ∼ subW( /2)
...
...
...
...
...
...
...
...
x
u(1)
u(2)
u(3) u( )
input 1st
hid. lay. 2nd
hid. lay. 3rd
hid. lay. th
hid. lay.
subW(1
2
) subW(1) subW(3
2
) subW(2
)
−100 −75 −50 −25 0 25 50 75 100
x
0.0
0.2
0.4
0.6
0.8
1.0
P(X≥x)
1st
layer
2nd
layer
3rd
layer
10th
layer
100th
layer

23/32
Proof sketch I
Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k
≤ Ckθ
, for all k ∈ N.

23/32
Proof sketch I
≤ Ckθ
, for all k ∈ N.
Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity
g(1)
(x) = W(1)
x, h(1)
(x) = φ(g(1)
),
g( )
(x) = W( )
h( −1)
(x), h( )
(x) = φ(g( )
), = {2, . . . , L}.

23/32
Proof sketch I
≤ Ckθ
, for all k ∈ N.
Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity
g(1)
(x) = W(1)
x, h(1)
(x) = φ(g(1)
),
g( )
(x) = W( )
h( −1)
(x), h( )
(x) = φ(g( )
), = {2, . . . , L}.
Goal. By induction with respect to hidden layer depth we want to show that
h( )
k k /2
.

24/32
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2

24/32
Proof sketch II
(1)
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2

24/32
Proof sketch II
(1)
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
h(1)
k = φ(g(1)
) k g(1)
k k1/2
2. Induction step: if g( −1)
, h( −1)
∼ subW (( − 1)/2), then for -th layer
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2

24/32
Proof sketch II
(1)
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
h(1)
k = φ(g(1)
) k g(1)
k k1/2
, h( −1)
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s
, ˜h( −1) t
≥ 0

24/32
Proof sketch II
(1)
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
h(1)
k = φ(g(1)
) k g(1)
k k1/2
, h( −1)
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
, ˜h( −1) t
≥ 0
2.2 Upper bound for ( ) by H¨older’s inequality

24/32
Proof sketch II
(1)
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
h(1)
k = φ(g(1)
) k g(1)
k k1/2
, h( −1)
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
, ˜h( −1) t
≥ 0
2.2 Upper bound for ( ) by H¨older’s inequality
h( )
k = φ(g( )
) k g( )
k k /2

25/32
Estimation procedure
CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is
F(y) = 1 − e−(y/λ)1/θ
for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ
.

25/32
Estimation procedure
CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is
F(y) = 1 − e−(y/λ)1/θ
for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ
.
Taking the logarithm
log q(t) = θ log log
1
1 − t
+ log λ,
suggests a simple estimation procedure by regressing largest order statistics against orders, in log scale.

26/32
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a ﬁxed input x of size 104
, which can be thought of as an image of dimension
100 × 100.

26/32
Illustration
100 × 100.
Quantile-quantile plot of the largest order statistics in log-log scale.

26/32
Illustration
100 × 100.
qq
q
qqq
qqqqqq
qq
qqqqq
q
qqqq
qqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
1.01.11.21.31.4
l = 1, H = 1
θ^ = 0.58
q
qqqq
q
qqq
qq
qq
qqqqqqqqq
qqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
1.52.02.53.0
l = 3, H = 1
θ^ = 1.71
q
q
q
q
qq
q
qqq
qqq
qq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
12345
l = 10, H = 1
θ^ = 4.37

26/32
Illustration
100 × 100.
q
q
q
q
q
qq
qq
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
1.21.41.61.82.02.2
l = 1, H = 3
θ^ = 0.66
q
q
qqq
qq
q
qq
qqq
qqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
4.04.55.0
l = 3, H = 3
θ^ = 1.18
q
qq
qqq
qqqqqq
qqqqqqqqqqqqqq
qqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
12.013.014.0
l = 10, H = 3
θ^ = 2.16

26/32
Illustration
100 × 100.
q
qq
qq
q
qqqqq
qq
qq
qqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
1.21.41.61.82.0
l = 1, H = 10
θ^ = 0.67
q
q
q
q
qqq
qq
qqq
qq
qqqqqqqqqqqqqq
q
qqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
5.05.45.8
l = 3, H = 10
θ^ = 0.89
q
q
q
qqq
qqqq
qq
qqqqq
qqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
18.019.020.0
l = 10, H = 10
θ^ = 1.35

27/32
Interpretation: shrinkage eﬀect
Maximum a Posteriori (MAP) is a Regularized problem
max
W
min
W
min
W
L(W) + λR(W)
, regularizer.
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10

28/32
MAP on weights W is weight decay
Gaussian prior on the weights:
π(W) =
L
=1 i,j
e− 1
2
(W
( )
i,j
)2
Equivalent to the weight decay penalty (L2
):
R(W) =
L
=1 i,j
(W
( )
i,j )2
= W 2
2

29/32
MAP on units U: regularization scheme
Marginal distributions:
weight distribution
π(w) ≈ e−w2
⇒
-th layer unit distribution
π( )
(u) ≈ e−u2/

29/32
weight distribution
π(w) ≈ e−w2
⇒
π( )
(u) ≈ e−u2/
Sklar’s representation theorem:
π(U) =
L
=1
H
m=1
π( )
m (U( )
m ) C(F(U)),
where C represents the copula of U (which characterizes all the dependence between the units)

29/32
weight distribution
π(w) ≈ e−w2
⇒
π( )
(u) ≈ e−u2/
Sklar’s representation theorem:
π(U) =
L
=1
H
m=1
π( )
m (U( )
m ) C(F(U)),
where C represents the copula of U (which characterizes all the dependence between the units)
R(U) = −
L
=1
H
m=1
log π( )
m (U( )
m ) − log C(F(U)),
≈
L
=1
H
m=1
|U( )
m |2/
− log C(F(U)),
≈ U(1) 2
2 + U
(2)
1 1 + · · · + U(L) 2/L
2/L − log C(F(U)).

30/32
Regularizer:
R(U) ≈ U(1) 2
2 + U
(2)
1 1 + · · · + U(L) 2/L
2/L − log C(F(U)).
Layer Penalty on W Penalty on U
1 W(1) 2
2, L2
U(1) 2
2 L2
(weight decay)
2 W(2) 2
2, L2
U(2)
L1
(Lasso)
W( ) 2
2, L2
U( ) 2/
2/ L2/
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10

31/32
Conclusion
(i) We deﬁned the notion of sub-Weibull and Weibull-tail distributions, which are characterized by
tails lighter than (or equally light as) Weibull distributions.
(ii) We proved that the marginal prior distribution of the units are heavier-tailed as depth increases.
(iii) We oﬀered an interpretation from a regularization viewpoint.
Main references:
• Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian
Neural Networks at the Unit Level.
ICML
https://arxiv.org/abs/1810.05193
• Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions:
generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions.
Stat
https://arxiv.org/abs/1905.04955

32/32
References
References
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. International Conference on Machine Learning.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural
Information Processing Systems, pages 8789–8798.
Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1–46.
Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J. (2018). Reliable uncertainty estimates in deep neural networks using noise contrastive priors.
Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning.
Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on
Computational learning theory, pages 5–13.
Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. (2019). Subspace inference for bayesian deep learning. arXiv preprint arXiv:1907.07504.
Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint
arXiv:1806.04854.
Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. (2019). Approximate inference turns deep networks into gaussian processes. In Advances in Neural Information Processing
Systems, pages 3088–3098.
Kuchibhotla, A. K. and Chakrabortty, A. (2018). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv
preprint arXiv:1804.02605.
Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deep neural networks as Gaussian processes. In International Conference on Machine
Learning.
Louizos, C., Shi, X., Schutte, K., and Welling, M. (2019). The functional neural process. In Advances in Neural Information Processing Systems, pages 8743–8754.
MacKay, D. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472.
Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information
Processing Systems, pages 13132–13143.
Matthews, A., Rowland, M., Hron, J., Turner, R., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. In International Conference on Learning
Representations, volume 1804.11271.
Neal, R. (1995). Bayesian learning for neural networks. Springer.
Polson, N. G. and Roˇckov´a, V. (2018). Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems, pages 930–941.
Pradier, M. F., Pan, W., Yao, J., Ghosh, S., and Doshi-Velez, F. (2018). Latent projection bnns: Avoiding weight-space pathologies by learning latent representations of neural
network weights. arXiv preprint arXiv:1811.07006.
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations.
Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779.
Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed
distributions. Stat.
Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian Neural Networks at the Unit Level. ICML.
Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning
(ICML-11), pages 681–688.

UCD_talk_nov_2020

Recommended

Recommended

More Related Content

Similar to UCD_talk_nov_2020

Similar to UCD_talk_nov_2020 (6)

More from Julyan Arbel

More from Julyan Arbel (20)

Recently uploaded

Recently uploaded (20)

UCD_talk_nov_2020