PAC Bayesian for Deep Learning

PAC-Bayesian Bound
for
Deep Learning
Mark Chang
2019/11/26

Outlines
• Learning Theory
• Generalization in Deep Learning
• PAC-Bayesian Bound
• PAC-Bayesian Bound for Deep Learning
• How to Overcome Over-fitting?

Learning Theory
data
training data
testing data
sampling
sampling
hypothesis
h
hypothesis
h
training algorithm:
minimize
training
error
update the hypothesis h
✏(h)
testing
error
ˆ✏(h)
Learning is feasible when is small -> is small✏(h)ˆ✏(h)

Learning Theory
• Over fitting : is small, but is large
• What cause over-fitting?
✏(h)
h
h h
under fitting over fittingappropriate fitting
too few
parameters
adequate
parameters
too many
parameters
training data
testing data
ˆ✏(h)

Learning Theory
• What cause over-fitting?
• Too many parameters -> over fitting ?
• Too many parameters -> high VC Dimension -> over fitting ?
• … ?

Learning Theory
• With 1-ẟ probability, the following inequality (VC Bound) is satisfied.
numbef of
training instances
VC Dimension
(model complexity)
✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
)

VC Dimension
• Model Complexity
• Categorized the hypothesis set into finite groups
• ex: 1D linear model
infinite hypothesis finite groups of hypothesis
VC dim = log(4) = 2

• Growth Function
• VC Dimension
VC Dimension
⌧H(n) = max
(x1,...,xn)
⌧(x1, ..., xn)
⌧(x1, ..., xn) =
n
h(x1), ..., h(xn) h 2 H
o
data samples hypothesis set
d(H) = max{n : ⌧H (n) = 2n
}

Growth Function
• Considering 2D Linear Model, and n=2
(h(x1), h(x2)) = (1, 1) (0, 1) (1, 0) (0, 0)
h
x1
x2
⌧(x1, x2) =
n
h(x1), h(xn) h 2 H
o
= 4
⌧(x1, ..., xn) =
n
h(x1), ..., h(xn) h 2 H
o

Growth Function
• Considering 2D Linear Model, and n=3
⌧H(3) = max
(x1,...,x3)
⌧(x1, ..., x3) = 8
⌧(x11, x12, x13) = 8
x11
x12 x13
⌧(x21, x22, x23) = 6
x21
x22
x23
⌧H(n) = max
(x1,...,xn)
⌧(x1, ..., xn)

VC Dimension
• VC-Dimension for 2D Linear Model (VC Dimension = 3)
…
⌧H(3) = 8 = 23
(shatter) ⌧H(4) = 14 < 24
(not shatter)
d(H) = max{n : ⌧H (n) = 2n
} = 3
d(H) = max{n : ⌧H (n) = 2n
}

VC Dimension
• VC Dimension is independent from data distribution
⌧H(n) = max
(x1,...,xn)
⌧(x1, ..., xn)
maximize over all possible data
distributionwith n samples
⌧(x11, x12, x13) = 8 ⌧(x21, x22, x23) = 6
⌧H(3) = max
(x1,...,x3)
⌧(x1, ..., x3) = 8
x11
x12 x13
x21
x22
x23
d(H) = max{n : ⌧H (n) = 2n
}
maximize over all possible
number of data samples
⌧H(3) = 8 = 23 ⌧H(4) = 14 < 24
d(H) = max{n : ⌧H (n) = 2n
} = 3

VC Dimension
• VC Dimension of linear model:
• O(W)
• W = number of parameters
• VC Dimension of fully-connected
neural networks:
• O(LW log W)
• L = number of layers
• W = number of parameters

VC Bound
• For a given dataset (n is constant), search for the best VC Dimension
✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
)
d=n (shatter)
d (VC Dimension)
over-fittingerror
best VC Dimension

VC Bound
• For a given dataset (n is constant), search for the best VC Dimension
training data
testing data
reduce
VC Dimension
Model with high VC
Dimension
Model with moderate
VC Dimension
over fitting !!

Generalization in Deep Learning
• Considering a toy example:
neural networks
input: 780 hidden: 600
d ≈ 26M
dataset
n = 50,000
d >> n, but testing error < 0.1

• However, when you are solving your problem …
neural networks
input:
780
hidden:
600
testing error = 0.6
over fitting !!
your dataset
n = 50,000
10 classes
testing error = 0.6
over fitting !!
reduce
VC Dimension
neural networks
input:
780
hidden:
200
…
reduce
VC Dimension

• Reconciling modern machine learning and the bias-variance trade-off
(arxiv 2018)
• Understanding deep learning requires rethinking generalization (ICLR
2017)
• On Large-Batch Training for Deep Learning: Generalization Gap and
Sharp Minima (ICLR 2017)

Reconciling modern machine learning and the
bias-variance trade-off
d (VC Dimension)
error
✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
)
d=n
over-fitting over-parameterization
model with extremely
high VC-Dimension

Reconciling modern machine learning and the
bias-variance trade-off
neural networks (two layers) random forest

ICLR2017

Understanding deep learning requires
rethinking generalization
1 0 1 0 2
random noise features
shatter !
deep
neural
networks
(Inception)
feature:
label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0

1 0 1 0 2
random noise features
deep
neural
networks
(Inception)
original dataset (CIFAR)
feature:
label: 1 0 1 0 2
random label
0 1 1 2 0
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9

• Testing error depends on data distribution
• However, VC-Bound does not depend on data distribution
1 0 1 0 2
random noise featuresoriginal dataset
feature:
label: 1 0 1 0 2
random label
0 1 1 2 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9

• Regularization of Deep Neural Networks
• only small improvement on testing error
✏(h) ⇡ 0.1397
✏(h) ⇡ 0.1069
weight decay
data augmentation
✏(h) ⇡ 0.1095
weight decay
+
data augmentation
✏(h) ⇡ 0.1424
no regularization

On Large-Batch Training for Deep Learning:
Generalization Gap and Sharp Minima
• high sharpness -> high testing error

On Large-Batch Training for Deep Learning:
Generalization Gap and Sharp Minima
• high sharpness -> high testing error
• large batch size -> high sharpness

PAC-Bayesian Bound
• Original PAC Bound :
• PAC-Bayesian Model Averaging (COLT 1999)
• PAC-Bayesian Bound for Deep Learning (Stochastic) :
• Computing NonvacuousGeneralization Bounds for Deep (Stochastic) Neural
Networks with Many More Parameters than Training Data (UAI 2017)
• PAC-Bayesian Bound for Deep Learning (Deterministic):
• A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural
Networks (ICLR 2018)

PAC-Bayesian Bound
• Deterministic Model • Stochastic Model (Gibbs Classifier)
data ✏(h)
hypothesis
h
error
data
✏(Q)
= Eh⇠Q(✏(h))
distribution of
hypothesis
hypothesis
h
Gibbs
error
sampling

PAC-Bayesian Bound
• Considering the sharpness of local minimums
single
hypothesis
h
distribution of
hypothesis
Q
flat minimum:
• low
• low
ˆ✏(h)
ˆ✏(Q)
sharp minimum:
• low
• high
ˆ✏(h)
ˆ✏(Q)

PAC-Bayesian Bound
• Training Gibbs Classifier
P (Prior) :
distribution of hypothesis
before training
Q (Posterior) :
distribution of hypothesis
after training
training
algorithm
training data

sampling
PAC-Bayesian Bound
• Training
data
training
data
sampling hypothesis
h
training
algorithm:
minimize
training error
update the distribution
of hypothesis Q
Distribution
of hypothesis
Q
ˆ✏(Q)
ˆ✏(Q) = Eh⇠Q(ˆ✏(h))

PAC-Bayesian Bound
• Testing
hypothesis
h
testing error
Trained
distribution
of hypothesis
Q
testing
data
sampling sampling
ˆ✏(Q)
data
✏(Q) = Eh⇠Q(✏(h))

PAC-Bayesian Bound
• Learning is feasible when is small -> is small
• With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is
satisfied
number of
training instances
KL divergence
between P and Q
ˆ✏(Q) ✏(Q)
✏(Q)  ˆ✏(Q) +
s
KL(QkP) + log(n
) + 2
2n 1

PAC-Bayesian Bound
✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
high ✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
, high
under fitting over fittingappropriate fitting
✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
moderate ✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + lo
2n 1
, moderate ✏(Q)  ˆ✏(Q) +
s
KL(Q||P
2
low ✏(Q) , high
low KL(Q||P) moderate KL(Q||P) high KL(Q||P)
|✏(Q) ˆ✏(Q)|low |✏(Q) ˆ✏(Q)|moderae |✏(Q) ˆ✏(Q)|high
training data
testing data
P
Q
P
Q
P
Q
✏(Q)  ˆ✏(Q) +
s
KL(QkP) + log(n
) + 2
2n 1

PAC-Bayesian Bound
• High VC Dimension, but clean data
-> low KL (Q||P)
• High VC Dimension, but noisy data
-> high KL(Q||P)
• PAC-Bayesian Bound is data-dependent
✏(Q)  ˆ✏(Q) +
s
KL(QkP) + log(n
) + 2
2n 1
P
Q Q
P

PAC-Bayesian Bound for Deep Learning
(Stochastic)
UAI 2017

(Stochastic)
• deterministic neural networks • stochastic neural networks (SNN)
w, bx y=wx+b
w, bx y=wx+b
distribution of w, b : N(W,S)
• N : normal distribution
• W : mean of weights
• S : covariance of weights
sampling

(Stochastic)
• With 1-ẟ probability, the following inequality is satisfied.
Q: SNN after training P: SNN before training
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1

Estimating the PAC-Bayesian Bound
•
• N : normal distribution
• w : mean of weights (d dimensional vector)
• s : covariance of weights (d x d diagonal matrix)
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1
KL(QkP) =
1
2
1
ksk1 d +
1
kw w0k2
2 + d log( ) 1d · log(s)
Q = N(w, s), P = N(w0, Id)

KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1
Estimating the PAC-Bayesian Bound
• is intractable
• Sampling m hypothesis from Q :
• with 1-ẟ’ probability, , the following inequality is satisfied.
ˆ✏(Q) = Eh⇠Q(ˆ✏(h))
KL(ˆ✏( ˆQ)kˆ✏(Q)) 
log 2
m 0
ˆ✏( ˆQ) =
1
m
mX
i=1
ˆ✏(hi), hi ⇠ Q

Experiments
• Data : M-NIST (binary classification, class0 : 0~4, class1 : 5~9)
• Model : 2 layer or 3 layer NN
original M-NIST random label
feature:
label: 0 0 0 1 1 1 0 1 0 1

Experiments
Training
Testing
VC Bound
Pac-Bayesian
Bound
0.028
0.034
26m
0.161
0.027
0.035
56m
0.179
Varying the width of
hidden layer (2 layer NN)
600 1200
0.027
0.032
121m
0.201
0.028
0.034
26m
0.161
0.028
0.033
66m
0.186
Varying the number of
hidden layers
2 3 4

Experiments
Training
Testing
VC Bound
Pac-Bayesian
Bound
0.028
0.034
26m
0.161
0.112
0.503
26m
1.352
random label
1 0 1 0 1
original M-NIST
0 0 0 1 1

• PAC-Bayesian Bound is Data Dependent
original M-NIST random label
clean data:
->small KL (Q||P)
->small ε(Q)
noisy data:
->large KL(Q||P)
->large ε(Q)
feature:
label: 0 0 0 1 1 1 0 1 0 1
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1

(Deterministic)

(Deterministic)
• Constrain on sharpness: given a margin γ > 0, if Q satisfy:
h
Q
h'
Ph0⇠Q
n
sup
x2X
kh0
(x) h(x)k1 
4
o 1
2

(Deterministic)
• Constrain on sharpness: given a margin γ > 0, if Q satisfy:
• With 1-ẟ probability, the following inequality is satisfied.
• margin loss L γ (h):
L0(h)  ˆL (h) + 4
s
KL(QkP) + log(6n
)
n 1
Ph0⇠Q
n
sup
x2X
kh0
(x) h(x)k1 
4
o 1
2
L (h) = P(x,y)⇠D
n
h(x)[y]  + max
j6=y
h(x)[j]
o

How to Overcome Over-fitting?
Traditional Machine Learning
• reduce the number of parameters
• weight decay
• early stop
• data augmentation
Modern Deep Learning
• reduce the number of parameters
• weight decay ?
• early stop ?
• data augmentation ?
• improving data quality
• starting from a good P (transfer
learning)
✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
) KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1

PAC Bayesian for Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PAC Bayesian for Deep Learning

Similar to PAC Bayesian for Deep Learning (20)

More from Mark Chang

More from Mark Chang (20)

Recently uploaded

Recently uploaded (20)

PAC Bayesian for Deep Learning