2. Overview
• Emergence of Invariance and Disentangling in Deep Representations
• Authors: Achille & Soatto (UCLA)
• Appeared in ICML 2017 Workshop
• Contribution
• Investigate relation between properties for representation
• Propose measure for network complexity
4. Properties for Representation
• Representation ! is a stochastic function of data ", that is useful for given task # while
nuisance $ affects to the data
• “Good representation” should satisfy
• sufficient: % !; # = %("; #)
• minimal: minimize % !; " among sufficient !
• invariant: minimize % !; $
• disentangled: minimize *+ ! = ,-(.(!) ∥ ∏1 . !1 )
• However, we will show that only minimal sufficiency is essential; i.e. invariance and
disentanglement are automatically satisfied in some model assumption
* TC: total correlation
** Actually, the assumption is not mild; still, the result is quite interesting
$
#
" !
5. Minimal Sufficiency ⇒ IB Lagrangian
• Information Bottleneck (IB) Lagrangian
ℒ = $ % & + ( ⋅ *(&; -)
• Minimizing IB Lagrangian yields the minimal sufficient representation
• By Data Processing Inequality (DPI), deep network
- → &0 → ⋯ → &2
• satisfies that * &2; - ≤ *(&0; -); i.e. stacking layer increases minimality
• Q. In real scenario, we do not optimize IB Lagrangian. Does it still apply?
• A. SGD implicitly implies minimality
* ResNet also satisfies the Markov chain, when we define & as “block”
6. Model Assumption
• Now we will show that under some model assumption
1. minimality implies invariance and disentanglement
2. SGD implicitly implies minimality
• Model Assumption
• Assume log-uniform prior on !; i.e. " !# ∝ 1/|!#|
• Assume posterior !#|( = *# ⋅ ,!# where *# ∼ log 1(−4#/2, 4#)
• 4# will be also optimized (Variational Dropout; Kingma’ 2015)
• Then weight information is
8 !; ( = −
1
2
:
#;<
=>? @
log 4# + B
7. Minimality ⇒ Invariance & Disentanglement
• Proposition 1. For a single layer " = $ ⋅ &,
' ( $; * ≤ ( "; & + -. " ≤ ' ( $; * + /
• where ' is some strictly increasing function and / = 0(1/ dim & )
• Corollary 1. For MLP,
( "8; & ≤ min
:;8
( ":<=, ": ≤ min
:;8
( ?: ⋅ ":, ":
• Here, we can only obtain the upper bound
• ⇒ minimality implies invariance and disentanglement
8. SGD ⇒ Minimize Weight Information
• Proposition 2. Let " be the Hessian at the local minimum #$. Assume (#$, ') is optimal
solution of IB Lagrangian ℒ = " + , + ' ⋅ /(,; 1). Then
/ $; 2 ≤ 4[log " ∗ + log #$ :
:
− log '4:]
• where 4 = dim($) and ⋅ ∗ is nuclear norm
• Empirical evidence: SGD converges to the flat minima; i.e. " ∗ = tr(") is small
• ⇒ SGD implicitly minimize the weight information
10. Revisit Overfitting
• Let !"($, &) be data distribution and (( ⋅ |$; ,) be neural network
• Decompose cross entropy loss
-.,/ 0 , = - 0 2 + 4 2; 0 , + 56(( ! − 4(0; ,|2)
• Since 4(0; ,|2) is intractable, we use 4(0; ,) as a regularizer; i.e. solve IB Lagrangian
ℒ = -.,/ 0 , + 9 ⋅ 4(0; ,)
• Also, we will use 4(0; ,) as a measure for model complexity
• 4(0; ,) is small if underfitting, large if overfitting
Intrinsic error sufficiency model efficiency overfitting
11. Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
12. Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
13. Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
14. Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
• For random label neural network overfits, but weight information increases
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
15. Revisit Rethinking Generalization
• [Zhang’ 2017] claimed that we need new generalization theory for deep learning
• Random Label Test: Deep learning easily fits random label
• For random label neural network overfits, but weight information increases
• …and it recovers bias-variance tradeoff
* [Zhang’ 2017] Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017.
16. Effect of !
• As ! increases, "($; &) decreases, and ( become more invariant (= lose information)
17. Conclusion
• Conclusion
1. Authors proposed the properties for “good representation”, and minimal sufficiency
is sufficient for invariance and disentanglement
2. Authors proposed the measure for neural network, which solves the paradox of the
rethinking generalization paper
• Research Question
1. minimality ⇒ invariance satisfies in general, but how about ⇒ disentanglement?
In which assumption can we guarantee disentanglement?
2. Weight information seems to be an alternative measure for generalization theory.
How can we estimate "($; &) efficiently for general neural network?