SlideShare a Scribd company logo
1 of 66
Download to read offline
Understanding Neural Networks Priors at the Units Level
&
Sub-Weibull property
Julyan Arbel
julyan.arbel@inria.fr www.julyanarbel.com,
Joint work with Mariia Vladimirova, St´ephane Girard and others
Inria Grenoble Rhˆone-Alpes
UCD Stat Seminar – November 23, 2020
1/32
A motivating example
2/32
3/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Outline
Introduction to Bayesian Deep Learning
Wide limit behavior of Bayesian Neural Networks
Sub-Weibull distributions
Understanding Neural Networks Priors at the Units Level
4/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Bayesian approach to DL
The distinguishing feature of the Bayesian approach is marginalization instead of optimization.
Bayesian model averaging (BMA) We want to obtain a predictive distribution for x given data D:
p(x|D) =
Θ
p(x|θ)
model
p(θ|D)
posterior
dθ
This can also be a conditional predictive if we are in a regression or classification problem
p(y|x, D) =
W
p(y|x, w)p(w|D) dw
Esp. hard with dim of W being of the order of 106
.
Cf. Wilson [2020] for a discussion & review.
5/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base
hypothesis (or parameter) is correct given the amount of available data.
5/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base
hypothesis (or parameter) is correct given the amount of available data.
Aleatoric uncertainty essentially, noise from the data measurements.
5/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base
hypothesis (or parameter) is correct given the amount of available data.
Aleatoric uncertainty essentially, noise from the data measurements.
Thus, a Bayesian approach considers epistemic uncertainties in a principled way, where these
uncertainties are carried over to the posterior distribution on parameter space.
6/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Link between Bayesian learning and regularized MLE
The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)
6/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Link between Bayesian learning and regularized MLE
The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)
It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization,
not marginalization.
6/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Link between Bayesian learning and regularized MLE
The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)
It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization,
not marginalization.
max
W
π(W|D) ∝ L(D|W)π(W)
min
W
− log L(D|W)− log π(W)
min
W
L(W) + λR(W)
L(W) is a loss function, R(W) is typically a norm on Rp
, regularizer.
6/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Link between Bayesian learning and regularized MLE
The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)
It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization,
not marginalization.
max
W
π(W|D) ∝ L(D|W)π(W)
min
W
− log L(D|W)− log π(W)
min
W
L(W) + λR(W)
L(W) is a loss function, R(W) is typically a norm on Rp
, regularizer.
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10
7/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Challenges of Bayesian Deep Learning
• Striving to build more interpretable parameter priors
Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep
neural networks, and they represent an acceptable description of a priori beliefs.
Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a,
2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level
of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al.,
2019; Hafner et al., 2018].
7/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Challenges of Bayesian Deep Learning
• Striving to build more interpretable parameter priors
Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep
neural networks, and they represent an acceptable description of a priori beliefs.
Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a,
2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level
of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al.,
2019; Hafner et al., 2018].
• Scaling-up algorithms for Bayesian deep learning
7/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Challenges of Bayesian Deep Learning
• Striving to build more interpretable parameter priors
Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep
neural networks, and they represent an acceptable description of a priori beliefs.
Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a,
2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level
of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al.,
2019; Hafner et al., 2018].
• Scaling-up algorithms for Bayesian deep learning
• Gaining theoretical insight and principled uncertainty quantification for deep learning
8/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Early works
Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
8/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Early works
Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
8/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Early works
Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
They have shown in particular the infinite width Gaussian process property of 1 hidden layer neural
networks.
Scaling-up algorithms for Bayesian deep learning
How to deal with the dealing with the huge dimensionality of Bayesian model averaging? There are a
variety of [scalable] approximate inference techniques available:
• Hamiltonian Monte Carlo (not scalable) [Neal, 1995]
• mean-field variational inference [Hinton and Van Camp, 1993; Blundell et al., 2015]
• Monte Carlo dropout [Gal and Ghahramani, 2016]
• exploring the link between deep networks and Gaussian processes [Lee et al., 2018; Matthews
et al., 2018; Khan et al., 2019],
• iterative learning from small mini-batches [Welling and Teh, 2011],
• using weight-perturbation approaches [Khan et al., 2018],
• investigating the information contained in the stochastic gradient descent trajectory [Maddox
et al., 2019],
• exploiting properties of the loss landscape [Garipov et al., 2018], by focusing on subspaces of low
dimensionality that capture a large amount of the variability of the posterior distribution [Izmailov
et al., 2019],
• applying non-linear transformations for dimensionality reduction [Pradier et al., 2018]
9/32
10/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Outline
Introduction to Bayesian Deep Learning
Wide limit behavior of Bayesian Neural Networks
Sub-Weibull distributions
Understanding Neural Networks Priors at the Units Level
11/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
12/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Wide regime: infinite number of hidden units in the layer
Theorem (Neal [1995])
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights
(A2) with properly scaled variances and
(A3) ReLU activation function.
Then conditional on input x, the marginal prior distribution of a unit u(2)
of 2-nd hidden layer
converges to a Gaussian process in a wide regime.
12/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Wide regime: infinite number of hidden units in the layer
Theorem (Neal [1995])
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights
(A2) with properly scaled variances and
(A3) ReLU activation function.
Then conditional on input x, the marginal prior distribution of a unit u(2)
of 2-nd hidden layer
converges to a Gaussian process in a wide regime.
Proof sketch
• u(1)
∼ N.
• Components of u(1)
are iid ⇒ CLT.
• u(2)
∼ GP (from CLT).
• But components of u(2)
are dependent.
...
...
...
...
...
...
...
x
u(1)
∼ N u(2)
∼ GP
W(1)
W(2)
13/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Wide regime: extension to deep networks
Lee et al. [2018]; Matthews et al. [2018]
14/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Wide regime: useful for developping new theory
Schoenholz et al. [2017]; Hayou et al. [2019]
15/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Gaussian process approximation
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation.
In International Conference on Learning Representations
• Prior on weights, w ∼ N(0, σ2
) iid
• Initialisation is a crucial step in deep NN
• ”Edge of Chaos” initialization can lead to good performances
Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural
networks training.
In International Conference on Machine Learning
• Prior on weights, w ∼ N(0, σ2
) iid
• Gaussian process approximation u ≈ GP(0, K ) marginally
• ”Edge of Chaos” initialization
Results:
• Smooth activation functions (e.g. ELU) are better than ReLU activation, especially if is large
• ”Edge of Chaos” accelerates the training and improves performances
16/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Outline
Introduction to Bayesian Deep Learning
Wide limit behavior of Bayesian Neural Networks
Sub-Weibull distributions
Understanding Neural Networks Priors at the Units Level
17/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Distribution families with respect to tail behavior
For all k ∈ N, k-th row moment: X k = E|X|k 1/k
Distribution Tail Moments
Sub-Gaussian F(x) ≤ e−λx2
X k ≤ C
√
k
Sub-Exponential F(x) ≤ e−λx
X k ≤ Ck
Sub-Weibull F(x) ≤ e−λx1/θ
X k ≤ Ckθ
Denoted by subW(θ), θ > 0 called tail parameter
See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull
17/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Distribution families with respect to tail behavior
For all k ∈ N, k-th row moment: X k = E|X|k 1/k
Distribution Tail Moments
Sub-Gaussian F(x) ≤ e−λx2
X k ≤ C
√
k
Sub-Exponential F(x) ≤ e−λx
X k ≤ Ck
Sub-Weibull F(x) ≤ e−λx1/θ
X k ≤ Ckθ
Denoted by subW(θ), θ > 0 called tail parameter
−40 −20 0 20 40
0.00.20.40.60.81.0
x
P(X≥x)
θ = 1
θ = 2
θ = 3
θ = 4
θ = 5
See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull
18/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Properties of sub-Weibull distributions
• Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2)
In particular: subW(1/2) = subGaussian, subW(1) = subExponential
18/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Properties of sub-Weibull distributions
• Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2)
In particular: subW(1/2) = subGaussian, subW(1) = subExponential
• Optimality: tail parameter θ is optimal if X k kθ
18/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Properties of sub-Weibull distributions
• Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2)
In particular: subW(1/2) = subGaussian, subW(1) = subExponential
• Optimality: tail parameter θ is optimal if X k kθ
• Closure under multiplication and addition: product and sum of sub-Weibull r.v. with tail
parameters θ1 and θ2 are resp. subW(θ1 + θ2) and subW(max(θ1, θ2)); optimal under independence
19/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Weibull-tail versus sub-Weibull
Cumulative distribution function F or probability distribution function f of random variable X is a
Weibull-tail distribution if for x > 0 it satisfies
F(x) = 1 − F(x) = e−x1/θ
L(x)
,
or
f (x) = e−x1/θ ˜L(x)
,
where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail
coefficient.
19/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Weibull-tail versus sub-Weibull
Cumulative distribution function F or probability distribution function f of random variable X is a
Weibull-tail distribution if for x > 0 it satisfies
F(x) = 1 − F(x) = e−x1/θ
L(x)
,
or
f (x) = e−x1/θ ˜L(x)
,
where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail
coefficient.
• Weibull-tail property provides a precise measure on the tails, but is useless for moments
sub-Weibull property it allows you to handle the moments, but is useless for tail assessment
20/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Outline
Introduction to Bayesian Deep Learning
Wide limit behavior of Bayesian Neural Networks
Sub-Weibull distributions
Understanding Neural Networks Priors at the Units Level
21/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Assumptions on neural network
To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are
required
21/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Assumptions on neural network
To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are
required
(A1) Parameters. The weights w have i.i.d Gaussian prior
w ∼ N(0, σ2
)
21/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Assumptions on neural network
To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are
required
(A1) Parameters. The weights w have i.i.d Gaussian prior
w ∼ N(0, σ2
)
(A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t.
|φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−,
|φ(u)| ≤ c2 + d2|u| for all u ∈ R.
Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
Nonlinearity does not harm the distributional tail:
φ(X) k X k , k ∈ N
22/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Main theorem
Theorem (Vladimirova et al, 2019)
22/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Main theorem
Theorem (Vladimirova et al, 2019)
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights and
(A2) nonlinearity satisfying envelope property.
22/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Main theorem
Theorem (Vladimirova et al, 2019)
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights and
(A2) nonlinearity satisfying envelope property.
Then conditional on input x, the marginal prior distribution of a unit u( )
of -th hidden layer is
sub-Weibull with optimal tail parameter θ = /2: π( )
(u) ∼ subW( /2)
22/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Main theorem
Theorem (Vladimirova et al, 2019)
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights and
(A2) nonlinearity satisfying envelope property.
Then conditional on input x, the marginal prior distribution of a unit u( )
of -th hidden layer is
sub-Weibull with optimal tail parameter θ = /2: π( )
(u) ∼ subW( /2)
...
...
...
...
...
...
...
...
x
u(1)
u(2)
u(3) u( )
input 1st
hid. lay. 2nd
hid. lay. 3rd
hid. lay. th
hid. lay.
subW(1
2
) subW(1) subW(3
2
) subW(2
)
22/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Main theorem
Theorem (Vladimirova et al, 2019)
Consider a Bayesian neural network with
(A1) iid Gaussian priors on the weights and
(A2) nonlinearity satisfying envelope property.
Then conditional on input x, the marginal prior distribution of a unit u( )
of -th hidden layer is
sub-Weibull with optimal tail parameter θ = /2: π( )
(u) ∼ subW( /2)
...
...
...
...
...
...
...
...
x
u(1)
u(2)
u(3) u( )
input 1st
hid. lay. 2nd
hid. lay. 3rd
hid. lay. th
hid. lay.
subW(1
2
) subW(1) subW(3
2
) subW(2
)
−100 −75 −50 −25 0 25 50 75 100
x
0.0
0.2
0.4
0.6
0.8
1.0
P(X≥x)
1st
layer
2nd
layer
3rd
layer
10th
layer
100th
layer
23/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch I
Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k
≤ Ckθ
, for all k ∈ N.
23/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch I
Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k
≤ Ckθ
, for all k ∈ N.
Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity
g(1)
(x) = W(1)
x, h(1)
(x) = φ(g(1)
),
g( )
(x) = W( )
h( −1)
(x), h( )
(x) = φ(g( )
), = {2, . . . , L}.
23/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch I
Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k
≤ Ckθ
, for all k ∈ N.
Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity
g(1)
(x) = W(1)
x, h(1)
(x) = φ(g(1)
),
g( )
(x) = W( )
h( −1)
(x), h( )
(x) = φ(g( )
), = {2, . . . , L}.
Goal. By induction with respect to hidden layer depth we want to show that
h( )
k k /2
.
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2
2. Induction step: if g( −1)
, h( −1)
∼ subW (( − 1)/2), then for -th layer
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2
2. Induction step: if g( −1)
, h( −1)
∼ subW (( − 1)/2), then for -th layer
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s
, ˜h( −1) t
≥ 0
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2
2. Induction step: if g( −1)
, h( −1)
∼ subW (( − 1)/2), then for -th layer
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s
, ˜h( −1) t
≥ 0
2.2 Upper bound for ( ) by H¨older’s inequality
24/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Proof sketch II
1. Base step: weights w
(1)
i are iid Gaussian ⇒ w k k1/2
; for 1st layer
g(1)
k =
H1
i=1
w
(1)
i xi
k
k1/2
From nonlinearity φ assumption
h(1)
k = φ(g(1)
) k g(1)
k k1/2
2. Induction step: if g( −1)
, h( −1)
∼ subW (( − 1)/2), then for -th layer
g( )
k =
H
i=1
w
( )
i h
( −1)
i
k
( )
k1/2
· k( −1)/2
= k /2
2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s
, ˜h( −1) t
≥ 0
2.2 Upper bound for ( ) by H¨older’s inequality
From nonlinearity φ assumption
h( )
k = φ(g( )
) k g( )
k k /2
25/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Estimation procedure
CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is
F(y) = 1 − e−(y/λ)1/θ
for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ
.
25/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Estimation procedure
CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is
F(y) = 1 − e−(y/λ)1/θ
for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ
.
Taking the logarithm
log q(t) = θ log log
1
1 − t
+ log λ,
suggests a simple estimation procedure by regressing largest order statistics against orders, in log scale.
26/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a fixed input x of size 104
, which can be thought of as an image of dimension
100 × 100.
26/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a fixed input x of size 104
, which can be thought of as an image of dimension
100 × 100.
Quantile-quantile plot of the largest order statistics in log-log scale.
26/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a fixed input x of size 104
, which can be thought of as an image of dimension
100 × 100.
Quantile-quantile plot of the largest order statistics in log-log scale.
qq
q
qqq
qqqqqq
qq
qqqqq
q
qqqq
qqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
1.01.11.21.31.4
l = 1, H = 1
θ^ = 0.58
q
qqqq
q
qqq
qq
qq
qqqqqqqqq
qqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
1.52.02.53.0
l = 3, H = 1
θ^ = 1.71
q
q
q
q
qq
q
qqq
qqq
qq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.6 1.8 2.0 2.2 2.4
12345
l = 10, H = 1
θ^ = 4.37
26/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a fixed input x of size 104
, which can be thought of as an image of dimension
100 × 100.
Quantile-quantile plot of the largest order statistics in log-log scale.
q
q
q
q
q
qq
qq
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
1.21.41.61.82.02.2
l = 1, H = 3
θ^ = 0.66
q
q
qqq
qq
q
qq
qqq
qqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
4.04.55.0
l = 3, H = 3
θ^ = 1.18
q
qq
qqq
qqqqqq
qqqqqqqqqqqqqq
qqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
12.013.014.0
l = 10, H = 3
θ^ = 2.16
26/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Illustration
Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in
{1, 3, 10}. With a fixed input x of size 104
, which can be thought of as an image of dimension
100 × 100.
Quantile-quantile plot of the largest order statistics in log-log scale.
q
qq
qq
q
qqqqq
qq
qq
qqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
1.21.41.61.82.0
l = 1, H = 10
θ^ = 0.67
q
q
q
q
qqq
qq
qqq
qq
qqqqqqqqqqqqqq
q
qqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
5.05.45.8
l = 3, H = 10
θ^ = 0.89
q
q
q
qqq
qqqq
qq
qqqqq
qqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
18.019.020.0
l = 10, H = 10
θ^ = 1.35
27/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Interpretation: shrinkage effect
Maximum a Posteriori (MAP) is a Regularized problem
max
W
π(W|D) ∝ L(D|W)π(W)
min
W
− log L(D|W)− log π(W)
min
W
L(W) + λR(W)
L(W) is a loss function, R(W) is typically a norm on Rp
, regularizer.
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10
28/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
MAP on weights W is weight decay
Gaussian prior on the weights:
π(W) =
L
=1 i,j
e− 1
2
(W
( )
i,j
)2
Equivalent to the weight decay penalty (L2
):
R(W) =
L
=1 i,j
(W
( )
i,j )2
= W 2
2
29/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
MAP on units U: regularization scheme
Marginal distributions:
weight distribution
π(w) ≈ e−w2
⇒
-th layer unit distribution
π( )
(u) ≈ e−u2/
29/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
MAP on units U: regularization scheme
Marginal distributions:
weight distribution
π(w) ≈ e−w2
⇒
-th layer unit distribution
π( )
(u) ≈ e−u2/
Sklar’s representation theorem:
π(U) =
L
=1
H
m=1
π( )
m (U( )
m ) C(F(U)),
where C represents the copula of U (which characterizes all the dependence between the units)
29/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
MAP on units U: regularization scheme
Marginal distributions:
weight distribution
π(w) ≈ e−w2
⇒
-th layer unit distribution
π( )
(u) ≈ e−u2/
Sklar’s representation theorem:
π(U) =
L
=1
H
m=1
π( )
m (U( )
m ) C(F(U)),
where C represents the copula of U (which characterizes all the dependence between the units)
R(U) = −
L
=1
H
m=1
log π( )
m (U( )
m ) − log C(F(U)),
≈
L
=1
H
m=1
|U( )
m |2/
− log C(F(U)),
≈ U(1) 2
2 + U
(2)
1 1 + · · · + U(L) 2/L
2/L − log C(F(U)).
30/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
MAP on units U: regularization scheme
Regularizer:
R(U) ≈ U(1) 2
2 + U
(2)
1 1 + · · · + U(L) 2/L
2/L − log C(F(U)).
Layer Penalty on W Penalty on U
1 W(1) 2
2, L2
U(1) 2
2 L2
(weight decay)
2 W(2) 2
2, L2
U(2)
L1
(Lasso)
W( ) 2
2, L2
U( ) 2/
2/ L2/
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 1
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 2
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 3
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0
U1
U2
Layer 10
31/32
Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
Conclusion
(i) We defined the notion of sub-Weibull and Weibull-tail distributions, which are characterized by
tails lighter than (or equally light as) Weibull distributions.
(ii) We proved that the marginal prior distribution of the units are heavier-tailed as depth increases.
(iii) We offered an interpretation from a regularization viewpoint.
Main references:
• Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian
Neural Networks at the Unit Level.
ICML
https://arxiv.org/abs/1810.05193
• Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions:
generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions.
Stat
https://arxiv.org/abs/1905.04955
32/32
References
References
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. International Conference on Machine Learning.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural
Information Processing Systems, pages 8789–8798.
Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1–46.
Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J. (2018). Reliable uncertainty estimates in deep neural networks using noise contrastive priors.
Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning.
Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on
Computational learning theory, pages 5–13.
Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. (2019). Subspace inference for bayesian deep learning. arXiv preprint arXiv:1907.07504.
Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint
arXiv:1806.04854.
Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. (2019). Approximate inference turns deep networks into gaussian processes. In Advances in Neural Information Processing
Systems, pages 3088–3098.
Kuchibhotla, A. K. and Chakrabortty, A. (2018). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv
preprint arXiv:1804.02605.
Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deep neural networks as Gaussian processes. In International Conference on Machine
Learning.
Louizos, C., Shi, X., Schutte, K., and Welling, M. (2019). The functional neural process. In Advances in Neural Information Processing Systems, pages 8743–8754.
MacKay, D. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472.
Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information
Processing Systems, pages 13132–13143.
Matthews, A., Rowland, M., Hron, J., Turner, R., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. In International Conference on Learning
Representations, volume 1804.11271.
Neal, R. (1995). Bayesian learning for neural networks. Springer.
Polson, N. G. and Roˇckov´a, V. (2018). Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems, pages 930–941.
Pradier, M. F., Pan, W., Yao, J., Ghosh, S., and Doshi-Velez, F. (2018). Latent projection bnns: Avoiding weight-space pathologies by learning latent representations of neural
network weights. arXiv preprint arXiv:1811.07006.
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations.
Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779.
Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed
distributions. Stat.
Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian Neural Networks at the Unit Level. ICML.
Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning
(ICML-11), pages 681–688.

More Related Content

Similar to UCD_talk_nov_2020

Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...AI Publications
 
Neur ips yomikai_at_ridgei_aaron_jan312020
Neur ips yomikai_at_ridgei_aaron_jan312020Neur ips yomikai_at_ridgei_aaron_jan312020
Neur ips yomikai_at_ridgei_aaron_jan312020Dr. Aaron C. Bell
 
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...Kato Mivule
 
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdfLagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdfAnkitBiswas31
 
Ontological knowledge integration for Bayesian network structure learning
Ontological knowledge integration for Bayesian network structure learningOntological knowledge integration for Bayesian network structure learning
Ontological knowledge integration for Bayesian network structure learningUniversity of Nantes
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralkhondekarLutfulHassa
 

Similar to UCD_talk_nov_2020 (6)

Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
 
Neur ips yomikai_at_ridgei_aaron_jan312020
Neur ips yomikai_at_ridgei_aaron_jan312020Neur ips yomikai_at_ridgei_aaron_jan312020
Neur ips yomikai_at_ridgei_aaron_jan312020
 
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
 
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdfLagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
 
Ontological knowledge integration for Bayesian network structure learning
Ontological knowledge integration for Bayesian network structure learningOntological knowledge integration for Bayesian network structure learning
Ontological knowledge integration for Bayesian network structure learning
 
Subspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectralSubspace discriminant approach_hyperspectral
Subspace discriminant approach_hyperspectral
 

More from Julyan Arbel

Bayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depthBayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depthJulyan Arbel
 
Species sampling models in Bayesian Nonparametrics
Species sampling models in Bayesian NonparametricsSpecies sampling models in Bayesian Nonparametrics
Species sampling models in Bayesian NonparametricsJulyan Arbel
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsJulyan Arbel
 
Asymptotics for discrete random measures
Asymptotics for discrete random measuresAsymptotics for discrete random measures
Asymptotics for discrete random measuresJulyan Arbel
 
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingBayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingJulyan Arbel
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972Julyan Arbel
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Julyan Arbel
 
Jefferys Berger 1992
Jefferys Berger 1992Jefferys Berger 1992
Jefferys Berger 1992Julyan Arbel
 
Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)Julyan Arbel
 

More from Julyan Arbel (20)

Bayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depthBayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depth
 
Species sampling models in Bayesian Nonparametrics
Species sampling models in Bayesian NonparametricsSpecies sampling models in Bayesian Nonparametrics
Species sampling models in Bayesian Nonparametrics
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
 
Asymptotics for discrete random measures
Asymptotics for discrete random measuresAsymptotics for discrete random measures
Asymptotics for discrete random measures
 
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingBayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian Nonparametrics
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian Nonparametrics
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972
 
Berger 2000
Berger 2000Berger 2000
Berger 2000
 
Seneta 1993
Seneta 1993Seneta 1993
Seneta 1993
 
Lehmann 1990
Lehmann 1990Lehmann 1990
Lehmann 1990
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
Jefferys Berger 1992
Jefferys Berger 1992Jefferys Berger 1992
Jefferys Berger 1992
 
Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
 
Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
 
R in latex
R in latexR in latex
R in latex
 
Arbel oviedo
Arbel oviedoArbel oviedo
Arbel oviedo
 
Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)
 
Causesof effects
Causesof effectsCausesof effects
Causesof effects
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 

Recently uploaded (20)

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 

UCD_talk_nov_2020

  • 1. Understanding Neural Networks Priors at the Units Level & Sub-Weibull property Julyan Arbel julyan.arbel@inria.fr www.julyanarbel.com, Joint work with Mariia Vladimirova, St´ephane Girard and others Inria Grenoble Rhˆone-Alpes UCD Stat Seminar – November 23, 2020 1/32
  • 3. 3/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Outline Introduction to Bayesian Deep Learning Wide limit behavior of Bayesian Neural Networks Sub-Weibull distributions Understanding Neural Networks Priors at the Units Level
  • 4. 4/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Bayesian approach to DL The distinguishing feature of the Bayesian approach is marginalization instead of optimization. Bayesian model averaging (BMA) We want to obtain a predictive distribution for x given data D: p(x|D) = Θ p(x|θ) model p(θ|D) posterior dθ This can also be a conditional predictive if we are in a regression or classification problem p(y|x, D) = W p(y|x, w)p(w|D) dw Esp. hard with dim of W being of the order of 106 . Cf. Wilson [2020] for a discussion & review.
  • 5. 5/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Uncertainty Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base hypothesis (or parameter) is correct given the amount of available data.
  • 6. 5/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Uncertainty Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base hypothesis (or parameter) is correct given the amount of available data. Aleatoric uncertainty essentially, noise from the data measurements.
  • 7. 5/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Uncertainty Epistemic uncertainty also known as model uncertainty, represents uncertainty over which base hypothesis (or parameter) is correct given the amount of available data. Aleatoric uncertainty essentially, noise from the data measurements. Thus, a Bayesian approach considers epistemic uncertainties in a principled way, where these uncertainties are carried over to the posterior distribution on parameter space.
  • 8. 6/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Link between Bayesian learning and regularized MLE The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE)
  • 9. 6/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Link between Bayesian learning and regularized MLE The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE) It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization, not marginalization.
  • 10. 6/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Link between Bayesian learning and regularized MLE The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE) It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization, not marginalization. max W π(W|D) ∝ L(D|W)π(W) min W − log L(D|W)− log π(W) min W L(W) + λR(W) L(W) is a loss function, R(W) is typically a norm on Rp , regularizer.
  • 11. 6/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Link between Bayesian learning and regularized MLE The Maximum a Posteriori (MAP) is a penalized Maximum Likelihood Estimator (MLE) It is akin to a Bayesian estimator under the 0-1 loss, but isn’t Bayesian: it is obtained by optimization, not marginalization. max W π(W|D) ∝ L(D|W)π(W) min W − log L(D|W)− log π(W) min W L(W) + λR(W) L(W) is a loss function, R(W) is typically a norm on Rp , regularizer. −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 1 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 2 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 3 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 10
  • 12. 7/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Challenges of Bayesian Deep Learning • Striving to build more interpretable parameter priors Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep neural networks, and they represent an acceptable description of a priori beliefs. Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a, 2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al., 2019; Hafner et al., 2018].
  • 13. 7/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Challenges of Bayesian Deep Learning • Striving to build more interpretable parameter priors Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep neural networks, and they represent an acceptable description of a priori beliefs. Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a, 2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al., 2019; Hafner et al., 2018]. • Scaling-up algorithms for Bayesian deep learning
  • 14. 7/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Challenges of Bayesian Deep Learning • Striving to build more interpretable parameter priors Vague priors such as Gaussian [Neal, 1995] over parameters are usually the default choice for deep neural networks, and they represent an acceptable description of a priori beliefs. Recent works have considered more elaborate priors such as spike and slab [Polson and Roˇckov´a, 2018] and horseshoe priors [Ghosh et al., 2019], and more informative parameter priors at the level of function spaces [Vladimirova et al., 2019; Sun et al., 2019; Yang et al., 2019; Louizos et al., 2019; Hafner et al., 2018]. • Scaling-up algorithms for Bayesian deep learning • Gaining theoretical insight and principled uncertainty quantification for deep learning
  • 15. 8/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Early works Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
  • 16. 8/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Early works Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992].
  • 17. 8/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Early works Works by Radford Neal [Neal, 1995] and David MacKay [MacKay, 1992]. They have shown in particular the infinite width Gaussian process property of 1 hidden layer neural networks.
  • 18. Scaling-up algorithms for Bayesian deep learning How to deal with the dealing with the huge dimensionality of Bayesian model averaging? There are a variety of [scalable] approximate inference techniques available: • Hamiltonian Monte Carlo (not scalable) [Neal, 1995] • mean-field variational inference [Hinton and Van Camp, 1993; Blundell et al., 2015] • Monte Carlo dropout [Gal and Ghahramani, 2016] • exploring the link between deep networks and Gaussian processes [Lee et al., 2018; Matthews et al., 2018; Khan et al., 2019], • iterative learning from small mini-batches [Welling and Teh, 2011], • using weight-perturbation approaches [Khan et al., 2018], • investigating the information contained in the stochastic gradient descent trajectory [Maddox et al., 2019], • exploiting properties of the loss landscape [Garipov et al., 2018], by focusing on subspaces of low dimensionality that capture a large amount of the variability of the posterior distribution [Izmailov et al., 2019], • applying non-linear transformations for dimensionality reduction [Pradier et al., 2018] 9/32
  • 19. 10/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Outline Introduction to Bayesian Deep Learning Wide limit behavior of Bayesian Neural Networks Sub-Weibull distributions Understanding Neural Networks Priors at the Units Level
  • 20. 11/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors
  • 21. 12/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Wide regime: infinite number of hidden units in the layer Theorem (Neal [1995]) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights (A2) with properly scaled variances and (A3) ReLU activation function. Then conditional on input x, the marginal prior distribution of a unit u(2) of 2-nd hidden layer converges to a Gaussian process in a wide regime.
  • 22. 12/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Wide regime: infinite number of hidden units in the layer Theorem (Neal [1995]) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights (A2) with properly scaled variances and (A3) ReLU activation function. Then conditional on input x, the marginal prior distribution of a unit u(2) of 2-nd hidden layer converges to a Gaussian process in a wide regime. Proof sketch • u(1) ∼ N. • Components of u(1) are iid ⇒ CLT. • u(2) ∼ GP (from CLT). • But components of u(2) are dependent. ... ... ... ... ... ... ... x u(1) ∼ N u(2) ∼ GP W(1) W(2)
  • 23. 13/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Wide regime: extension to deep networks Lee et al. [2018]; Matthews et al. [2018]
  • 24. 14/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Wide regime: useful for developping new theory Schoenholz et al. [2017]; Hayou et al. [2019]
  • 25. 15/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Gaussian process approximation Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations • Prior on weights, w ∼ N(0, σ2 ) iid • Initialisation is a crucial step in deep NN • ”Edge of Chaos” initialization can lead to good performances Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning • Prior on weights, w ∼ N(0, σ2 ) iid • Gaussian process approximation u ≈ GP(0, K ) marginally • ”Edge of Chaos” initialization Results: • Smooth activation functions (e.g. ELU) are better than ReLU activation, especially if is large • ”Edge of Chaos” accelerates the training and improves performances
  • 26. 16/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Outline Introduction to Bayesian Deep Learning Wide limit behavior of Bayesian Neural Networks Sub-Weibull distributions Understanding Neural Networks Priors at the Units Level
  • 27. 17/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Distribution families with respect to tail behavior For all k ∈ N, k-th row moment: X k = E|X|k 1/k Distribution Tail Moments Sub-Gaussian F(x) ≤ e−λx2 X k ≤ C √ k Sub-Exponential F(x) ≤ e−λx X k ≤ Ck Sub-Weibull F(x) ≤ e−λx1/θ X k ≤ Ckθ Denoted by subW(θ), θ > 0 called tail parameter See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull
  • 28. 17/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Distribution families with respect to tail behavior For all k ∈ N, k-th row moment: X k = E|X|k 1/k Distribution Tail Moments Sub-Gaussian F(x) ≤ e−λx2 X k ≤ C √ k Sub-Exponential F(x) ≤ e−λx X k ≤ Ck Sub-Weibull F(x) ≤ e−λx1/θ X k ≤ Ckθ Denoted by subW(θ), θ > 0 called tail parameter −40 −20 0 20 40 0.00.20.40.60.81.0 x P(X≥x) θ = 1 θ = 2 θ = 3 θ = 4 θ = 5 See Kuchibhotla and Chakrabortty [2018]; Vladimirova et al. [2020] for sub-Weibull
  • 29. 18/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Properties of sub-Weibull distributions • Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2) In particular: subW(1/2) = subGaussian, subW(1) = subExponential
  • 30. 18/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Properties of sub-Weibull distributions • Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2) In particular: subW(1/2) = subGaussian, subW(1) = subExponential • Optimality: tail parameter θ is optimal if X k kθ
  • 31. 18/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Properties of sub-Weibull distributions • Inclusion: θ1 ≤ θ2 implies subW(θ1) ⊂ subW(θ2) In particular: subW(1/2) = subGaussian, subW(1) = subExponential • Optimality: tail parameter θ is optimal if X k kθ • Closure under multiplication and addition: product and sum of sub-Weibull r.v. with tail parameters θ1 and θ2 are resp. subW(θ1 + θ2) and subW(max(θ1, θ2)); optimal under independence
  • 32. 19/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Weibull-tail versus sub-Weibull Cumulative distribution function F or probability distribution function f of random variable X is a Weibull-tail distribution if for x > 0 it satisfies F(x) = 1 − F(x) = e−x1/θ L(x) , or f (x) = e−x1/θ ˜L(x) , where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail coefficient.
  • 33. 19/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Weibull-tail versus sub-Weibull Cumulative distribution function F or probability distribution function f of random variable X is a Weibull-tail distribution if for x > 0 it satisfies F(x) = 1 − F(x) = e−x1/θ L(x) , or f (x) = e−x1/θ ˜L(x) , where L(.), ˜L(.) are normalised slowly-varying functions and parameter θ > 0 is called the Weibull-tail coefficient. • Weibull-tail property provides a precise measure on the tails, but is useless for moments sub-Weibull property it allows you to handle the moments, but is useless for tail assessment
  • 34. 20/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Outline Introduction to Bayesian Deep Learning Wide limit behavior of Bayesian Neural Networks Sub-Weibull distributions Understanding Neural Networks Priors at the Units Level
  • 35. 21/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Assumptions on neural network To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are required
  • 36. 21/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Assumptions on neural network To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are required (A1) Parameters. The weights w have i.i.d Gaussian prior w ∼ N(0, σ2 )
  • 37. 21/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Assumptions on neural network To prove that Bayesian neural networks become heavier-tailed with depth, following assumptions are required (A1) Parameters. The weights w have i.i.d Gaussian prior w ∼ N(0, σ2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t. |φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−, |φ(u)| ≤ c2 + d2|u| for all u ∈ R. Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. Nonlinearity does not harm the distributional tail: φ(X) k X k , k ∈ N
  • 38. 22/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Main theorem Theorem (Vladimirova et al, 2019)
  • 39. 22/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Main theorem Theorem (Vladimirova et al, 2019) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property.
  • 40. 22/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Main theorem Theorem (Vladimirova et al, 2019) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x, the marginal prior distribution of a unit u( ) of -th hidden layer is sub-Weibull with optimal tail parameter θ = /2: π( ) (u) ∼ subW( /2)
  • 41. 22/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Main theorem Theorem (Vladimirova et al, 2019) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x, the marginal prior distribution of a unit u( ) of -th hidden layer is sub-Weibull with optimal tail parameter θ = /2: π( ) (u) ∼ subW( /2) ... ... ... ... ... ... ... ... x u(1) u(2) u(3) u( ) input 1st hid. lay. 2nd hid. lay. 3rd hid. lay. th hid. lay. subW(1 2 ) subW(1) subW(3 2 ) subW(2 )
  • 42. 22/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Main theorem Theorem (Vladimirova et al, 2019) Consider a Bayesian neural network with (A1) iid Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x, the marginal prior distribution of a unit u( ) of -th hidden layer is sub-Weibull with optimal tail parameter θ = /2: π( ) (u) ∼ subW( /2) ... ... ... ... ... ... ... ... x u(1) u(2) u(3) u( ) input 1st hid. lay. 2nd hid. lay. 3rd hid. lay. th hid. lay. subW(1 2 ) subW(1) subW(3 2 ) subW(2 ) −100 −75 −50 −25 0 25 50 75 100 x 0.0 0.2 0.4 0.6 0.8 1.0 P(X≥x) 1st layer 2nd layer 3rd layer 10th layer 100th layer
  • 43. 23/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch I Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k ≤ Ckθ , for all k ∈ N.
  • 44. 23/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch I Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k ≤ Ckθ , for all k ∈ N. Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity g(1) (x) = W(1) x, h(1) (x) = φ(g(1) ), g( ) (x) = W( ) h( −1) (x), h( ) (x) = φ(g( ) ), = {2, . . . , L}.
  • 45. 23/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch I Recall. X ∼ subW(θ) ⇐⇒ ∃C > 0, X k = E|X|k 1/k ≤ Ckθ , for all k ∈ N. Notations. φ(·) — nonlinearity, g — pre-nonlinearity, h — post-nonlinearity g(1) (x) = W(1) x, h(1) (x) = φ(g(1) ), g( ) (x) = W( ) h( −1) (x), h( ) (x) = φ(g( ) ), = {2, . . . , L}. Goal. By induction with respect to hidden layer depth we want to show that h( ) k k /2 .
  • 46. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2
  • 47. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2 From nonlinearity φ assumption h(1) k = φ(g(1) ) k g(1) k k1/2
  • 48. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2 From nonlinearity φ assumption h(1) k = φ(g(1) ) k g(1) k k1/2 2. Induction step: if g( −1) , h( −1) ∼ subW (( − 1)/2), then for -th layer g( ) k = H i=1 w ( ) i h ( −1) i k ( ) k1/2 · k( −1)/2 = k /2
  • 49. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2 From nonlinearity φ assumption h(1) k = φ(g(1) ) k g(1) k k1/2 2. Induction step: if g( −1) , h( −1) ∼ subW (( − 1)/2), then for -th layer g( ) k = H i=1 w ( ) i h ( −1) i k ( ) k1/2 · k( −1)/2 = k /2 2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s , ˜h( −1) t ≥ 0
  • 50. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2 From nonlinearity φ assumption h(1) k = φ(g(1) ) k g(1) k k1/2 2. Induction step: if g( −1) , h( −1) ∼ subW (( − 1)/2), then for -th layer g( ) k = H i=1 w ( ) i h ( −1) i k ( ) k1/2 · k( −1)/2 = k /2 2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s , ˜h( −1) t ≥ 0 2.2 Upper bound for ( ) by H¨older’s inequality
  • 51. 24/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Proof sketch II 1. Base step: weights w (1) i are iid Gaussian ⇒ w k k1/2 ; for 1st layer g(1) k = H1 i=1 w (1) i xi k k1/2 From nonlinearity φ assumption h(1) k = φ(g(1) ) k g(1) k k1/2 2. Induction step: if g( −1) , h( −1) ∼ subW (( − 1)/2), then for -th layer g( ) k = H i=1 w ( ) i h ( −1) i k ( ) k1/2 · k( −1)/2 = k /2 2.1 Lower bound for ( ) by positive covariance result: ∀s, t, Cov h( −1) s , ˜h( −1) t ≥ 0 2.2 Upper bound for ( ) by H¨older’s inequality From nonlinearity φ assumption h( ) k = φ(g( ) ) k g( ) k k /2
  • 52. 25/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Estimation procedure CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is F(y) = 1 − e−(y/λ)1/θ for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ .
  • 53. 25/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Estimation procedure CDF of a Weibull random variable with tail parameter θ > 0 and scale parameter λ > 0 is F(y) = 1 − e−(y/λ)1/θ for all y ≥ 0, with corresponding quantile function q(t) = λ (− log(1 − t))θ . Taking the logarithm log q(t) = θ log log 1 1 − t + log λ, suggests a simple estimation procedure by regressing largest order statistics against orders, in log scale.
  • 54. 26/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Illustration Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in {1, 3, 10}. With a fixed input x of size 104 , which can be thought of as an image of dimension 100 × 100.
  • 55. 26/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Illustration Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in {1, 3, 10}. With a fixed input x of size 104 , which can be thought of as an image of dimension 100 × 100. Quantile-quantile plot of the largest order statistics in log-log scale.
  • 56. 26/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Illustration Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in {1, 3, 10}. With a fixed input x of size 104 , which can be thought of as an image of dimension 100 × 100. Quantile-quantile plot of the largest order statistics in log-log scale. qq q qqq qqqqqq qq qqqqq q qqqq qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 1.6 1.8 2.0 2.2 2.4 1.01.11.21.31.4 l = 1, H = 1 θ^ = 0.58 q qqqq q qqq qq qq qqqqqqqqq qqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 1.6 1.8 2.0 2.2 2.4 1.52.02.53.0 l = 3, H = 1 θ^ = 1.71 q q q q qq q qqq qqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 1.6 1.8 2.0 2.2 2.4 12345 l = 10, H = 1 θ^ = 4.37
  • 57. 26/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Illustration Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in {1, 3, 10}. With a fixed input x of size 104 , which can be thought of as an image of dimension 100 × 100. Quantile-quantile plot of the largest order statistics in log-log scale. q q q q q qq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.21.41.61.82.02.2 l = 1, H = 3 θ^ = 0.66 q q qqq qq q qq qqq qqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 4.04.55.0 l = 3, H = 3 θ^ = 1.18 q qq qqq qqqqqq qqqqqqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 12.013.014.0 l = 10, H = 3 θ^ = 2.16
  • 58. 26/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Illustration Neural networks of 10 hidden layers, with H = H hidden units on each layers, and we made H vary in {1, 3, 10}. With a fixed input x of size 104 , which can be thought of as an image of dimension 100 × 100. Quantile-quantile plot of the largest order statistics in log-log scale. q qq qq q qqqqq qq qq qqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.21.41.61.82.0 l = 1, H = 10 θ^ = 0.67 q q q q qqq qq qqq qq qqqqqqqqqqqqqq q qqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 5.05.45.8 l = 3, H = 10 θ^ = 0.89 q q q qqq qqqq qq qqqqq qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 18.019.020.0 l = 10, H = 10 θ^ = 1.35
  • 59. 27/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Interpretation: shrinkage effect Maximum a Posteriori (MAP) is a Regularized problem max W π(W|D) ∝ L(D|W)π(W) min W − log L(D|W)− log π(W) min W L(W) + λR(W) L(W) is a loss function, R(W) is typically a norm on Rp , regularizer. −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 1 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 2 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1U2 Layer 3 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 10
  • 60. 28/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors MAP on weights W is weight decay Gaussian prior on the weights: π(W) = L =1 i,j e− 1 2 (W ( ) i,j )2 Equivalent to the weight decay penalty (L2 ): R(W) = L =1 i,j (W ( ) i,j )2 = W 2 2
  • 61. 29/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors MAP on units U: regularization scheme Marginal distributions: weight distribution π(w) ≈ e−w2 ⇒ -th layer unit distribution π( ) (u) ≈ e−u2/
  • 62. 29/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors MAP on units U: regularization scheme Marginal distributions: weight distribution π(w) ≈ e−w2 ⇒ -th layer unit distribution π( ) (u) ≈ e−u2/ Sklar’s representation theorem: π(U) = L =1 H m=1 π( ) m (U( ) m ) C(F(U)), where C represents the copula of U (which characterizes all the dependence between the units)
  • 63. 29/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors MAP on units U: regularization scheme Marginal distributions: weight distribution π(w) ≈ e−w2 ⇒ -th layer unit distribution π( ) (u) ≈ e−u2/ Sklar’s representation theorem: π(U) = L =1 H m=1 π( ) m (U( ) m ) C(F(U)), where C represents the copula of U (which characterizes all the dependence between the units) R(U) = − L =1 H m=1 log π( ) m (U( ) m ) − log C(F(U)), ≈ L =1 H m=1 |U( ) m |2/ − log C(F(U)), ≈ U(1) 2 2 + U (2) 1 1 + · · · + U(L) 2/L 2/L − log C(F(U)).
  • 64. 30/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors MAP on units U: regularization scheme Regularizer: R(U) ≈ U(1) 2 2 + U (2) 1 1 + · · · + U(L) 2/L 2/L − log C(F(U)). Layer Penalty on W Penalty on U 1 W(1) 2 2, L2 U(1) 2 2 L2 (weight decay) 2 W(2) 2 2, L2 U(2) L1 (Lasso) W( ) 2 2, L2 U( ) 2/ 2/ L2/ −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 1 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 2 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 3 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U1 U2 Layer 10
  • 65. 31/32 Introduction to BDL Wide limit behavior of BNN Sub-Weibull distributions Understanding NN Priors Conclusion (i) We defined the notion of sub-Weibull and Weibull-tail distributions, which are characterized by tails lighter than (or equally light as) Weibull distributions. (ii) We proved that the marginal prior distribution of the units are heavier-tailed as depth increases. (iii) We offered an interpretation from a regularization viewpoint. Main references: • Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian Neural Networks at the Unit Level. ICML https://arxiv.org/abs/1810.05193 • Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions. Stat https://arxiv.org/abs/1905.04955
  • 66. 32/32 References References Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. International Conference on Machine Learning. Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning. Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798. Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1–46. Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J. (2018). Reliable uncertainty estimates in deep neural networks using noise contrastive priors. Hayou, S., Doucet, A., and Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning. Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. (2019). Subspace inference for bayesian deep learning. arXiv preprint arXiv:1907.07504. Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854. Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. (2019). Approximate inference turns deep networks into gaussian processes. In Advances in Neural Information Processing Systems, pages 3088–3098. Kuchibhotla, A. K. and Chakrabortty, A. (2018). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv preprint arXiv:1804.02605. Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deep neural networks as Gaussian processes. In International Conference on Machine Learning. Louizos, C., Shi, X., Schutte, K., and Welling, M. (2019). The functional neural process. In Advances in Neural Information Processing Systems, pages 8743–8754. MacKay, D. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472. Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143. Matthews, A., Rowland, M., Hron, J., Turner, R., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, volume 1804.11271. Neal, R. (1995). Bayesian learning for neural networks. Springer. Polson, N. G. and Roˇckov´a, V. (2018). Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems, pages 930–941. Pradier, M. F., Pan, W., Yao, J., Ghosh, S., and Doshi-Velez, F. (2018). Latent projection bnns: Avoiding weight-space pathologies by learning latent representations of neural network weights. arXiv preprint arXiv:1811.07006. Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations. Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779. Vladimirova, M., Girard, S., Nguyen, H. D., and Arbel, J. (2020). Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions. Stat. Vladimirova, M., Verbeek, J., Mesejo, P., and Arbel, J. (2019). Understanding Priors in Bayesian Neural Networks at the Unit Level. ICML. Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.