The document outlines the PAC-Bayesian bound for deep learning. It discusses how the PAC-Bayesian bound provides a generalization guarantee that depends on the KL divergence between the prior and posterior distributions over hypotheses. This allows the bound to account for factors like model complexity and noise in the training data, avoiding some limitations of other generalization bounds. The document also explains how the PAC-Bayesian bound can be applied to stochastic neural networks by placing distributions over the network weights.
4. Learning Theory
• Over fitting : is small, but is large
• What cause over-fitting?
✏(h)
h
h h
under fitting over fittingappropriate fitting
too few
parameters
adequate
parameters
too many
parameters
training data
testing data
ˆ✏(h)
6. Learning Theory
• With 1-ẟ probability, the following inequality (VC Bound) is satisfied.
numbef of
training instances
VC Dimension
(model complexity)
✏(h) ˆ✏(h) +
r
8
n
log(
4(2n)d
)
7. VC Dimension
• Model Complexity
• Categorized the hypothesis set into finite groups
• ex: 1D linear model
infinite hypothesis finite groups of hypothesis
VC dim = log(4) = 2
8. • Growth Function
• VC Dimension
VC Dimension
⌧H(n) = max
(x1,...,xn)
⌧(x1, ..., xn)
⌧(x1, ..., xn) =
n
h(x1), ..., h(xn) h 2 H
o
data samples hypothesis set
d(H) = max{n : ⌧H (n) = 2n
}
30. PAC-Bayesian Bound
• Original PAC Bound :
• PAC-Bayesian Model Averaging (COLT 1999)
• PAC-Bayesian Bound for Deep Learning (Stochastic) :
• Computing NonvacuousGeneralization Bounds for Deep (Stochastic) Neural
Networks with Many More Parameters than Training Data (UAI 2017)
• PAC-Bayesian Bound for Deep Learning (Deterministic):
• A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural
Networks (ICLR 2018)
32. PAC-Bayesian Bound
• Deterministic Model • Stochastic Model (Gibbs Classifier)
data ✏(h)
hypothesis
h
error
data
✏(Q)
= Eh⇠Q(✏(h))
distribution of
hypothesis
hypothesis
h
Gibbs
error
sampling
37. PAC-Bayesian Bound
• Learning is feasible when is small -> is small
• With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is
satisfied
number of
training instances
KL divergence
between P and Q
ˆ✏(Q) ✏(Q)
✏(Q) ˆ✏(Q) +
s
KL(QkP) + log(n
) + 2
2n 1
38. PAC-Bayesian Bound
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
high ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
, high
under fitting over fittingappropriate fitting
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + lo
2n 1
, moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P
2
low ✏(Q) , high
low KL(Q||P) moderate KL(Q||P) high KL(Q||P)
|✏(Q) ˆ✏(Q)|low |✏(Q) ˆ✏(Q)|moderae |✏(Q) ˆ✏(Q)|high
training data
testing data
P
Q
P
Q
P
Q
✏(Q) ˆ✏(Q) +
s
KL(QkP) + log(n
) + 2
2n 1
44. KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1
Estimating the PAC-Bayesian Bound
• is intractable
• Sampling m hypothesis from Q :
• with 1-ẟ’ probability, , the following inequality is satisfied.
ˆ✏(Q) = Eh⇠Q(ˆ✏(h))
KL(ˆ✏( ˆQ)kˆ✏(Q))
log 2
m 0
ˆ✏( ˆQ) =
1
m
mX
i=1
ˆ✏(hi), hi ⇠ Q
51. PAC-Bayesian Bound for Deep Learning
(Deterministic)
• Constrain on sharpness: given a margin γ > 0, if Q satisfy:
• With 1-ẟ probability, the following inequality is satisfied.
• margin loss L γ (h):
L0(h) ˆL (h) + 4
s
KL(QkP) + log(6n
)
n 1
Ph0⇠Q
n
sup
x2X
kh0
(x) h(x)k1
4
o 1
2
L (h) = P(x,y)⇠D
n
h(x)[y] + max
j6=y
h(x)[j]
o
52. How to Overcome Over-fitting?
Traditional Machine Learning
• reduce the number of parameters
• weight decay
• early stop
• data augmentation
Modern Deep Learning
• reduce the number of parameters
• weight decay ?
• early stop ?
• data augmentation ?
• improving data quality
• starting from a good P (transfer
learning)
✏(h) ˆ✏(h) +
r
8
n
log(
4(2n)d
) KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1