1) The document discusses several papers on deep learning models that use information theory concepts like mutual information and variational information bottleneck.
2) It describes the deep variational information bottleneck model which learns representations that maximize mutual information with the labels while minimizing information about the inputs.
3) Other models discussed aim to learn disentangled and invariant representations by regularizing the information contained in the weights through techniques like information dropout.
2. DEEP VARIATIONAL INFORMATION BOTTLENECK
• : Tishby et al. 1999 IB Deep robustness
• : VIB
• : Entropy regulariza(on (Pereyra 2017)
• : X Z Y X Z Z Y
Alexander A. Alemi et al. (Google Research), ICLR 2017
The second term encourage Z forget X. it forces Z to act
like a minimal sufficient statistic of X for predicting Y
Variational approximation and re-parameterization trick
Permutation-invariant MNIST Features mapping and error on different beta
Target maximization function
Alemi, A., Fischer, I., Dillon, J., Murphy, K. (2016). Deep Variational Information Bottleneck arXiv.org cs.LG()
3. DEEP VARIATIONAL INFORMATION BOTTLENECK (cont.)
Alexander A. Alemi et al. (Google Research), ICLR 2017
Low beta allows the large I(Z, X), and large I(Z, X) causes the overfit on test set with decreasing I(Z, Y)
• Future direc*on: Open universe classifica(on problem, sequence predic(on
• Connec*on to VAE
If we consider unsupervised versions of IB, it derives VAE loss.
The aim is to take our data X and maximize the mutual information contained in some encoding Z,
while restricting how much information we allow our representation to contain about the identity of each
data element in our sample (i)
• Rela*onship between I(Z, X) and I(Z, Y) and between beta and I(Z, X)
Alemi, A., Fischer, I., Dillon, J., Murphy, K. (2016). Deep Variational Information Bottleneck arXiv.org cs.LG()
4. Informa(on Dropout: Learning Op(mal Representa(ons Through Noisy Computa(on
• : IB dropout
• : Informa(on Dropout (VIB dropout ) + TC term
• : TCVAE
Alessandro Achille and Stefano Soa^o, IEEE 2018
• IB Lagrangian
• Approxima*on of noise injec*on (log-Normal dist.)
• Disentanglement by measuring the total correla*on
Minimizing TC term is intractable, but if we choose β=γ, it can be easily solved.
Stochas(c dropout
Achille, A., Soatto, S. (2017). Information Dropout: Learning Optimal Representations Through Noisy Computation IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12), 2897-2905.
8. Emergence of Invariance and Disentanglement in Deep Representa(ons
• Informa*on decomposi*on of cross entropy
Alessandro Achille and Stefano Soa^o, Journal of Machine Learning Research 2018
• To prevent overfiVng, we added a constraint of informa*on
Intrinsic error: prediction of the label even if we knew the underlying data distribution
Sufficiency: how much information the dataset has about the parameter theta, which is measured from the weights
Efficiency: efficiency of the model and class of functions with respect to which the loss is optimized
Overfitting: uninformative information of the underlying data distribution, memorized in the weights
• Flat minima have low informa*on
Since the second term is intractable, we use the general upper-bound below.
Networks with low informa*on in the weights realize invariant and disentangled representa*ons
Therefore, invariance and disentanglement emerge naturally when training a network with
implicit (SGD) or explicit (IB Lagrangian) regulariza*on, and are related to flat minima.
Achille, A., Soatto, S. (2017). Emergence of Invariance and Disentanglement in Deep Representations arXiv.org cs.LG()