IMSAT is an unsupervised learning method that uses information maximization and self-augmented training to learn discrete representations from data. It trains a deep neural network to map inputs to discrete outputs while maximizing the mutual information between them. Self-augmented training encourages the representations of augmented data points to be similar to the original points, imposing invariance to perturbations. Experiments on MNIST clustering and hashing achieve state-of-the-art results, demonstrating IMSAT can learn robust representations under various transformations.
Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning
1. 1
Learning Discrete Representations via
Information Maximizing
Self-Augmented Training
Weihua Hu, Takeru Miyato, Seiya Tokui,
Eiichi Matsumoto, Masashi Sugiyama
Intelligent Information processing II
Nov 20, 2017
University of Tokyo, RIKEN AIP, Preferred Networks, Inc.
Proceedings of the 34th International Conference on Machine Learning
Presented by Shunsuke KITADA
2. The reason why I chose this paper
● With unsupervised learning achieved high accuracy (98%!)
in MNIST classification.
● Published from the University of Tokyo (Sugiyama lab)
and Preferred Networks.
● VAT is used as effective regularization term.
● Accepted by ICML 2017.
2
3. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion
3
4. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion
4
5. Introduction
● Unsupervised discrete representation Learning
5
○ To obtain a function that maps similar (or dissimilar) data into
similar (or dissimilar) discrete representations.
○ The similarity of data is defined according to applications of
interests.
6. Introduction
● Clustering and Hash learning
6
○ Clustering
■ Widely applied to data-driven application
domains. [Berkhin 2006]
○ Hash learning
■ Popular for an approximate nearest neighbor search for
large scale information retrieval. [Wang+ 2016]
7. Introduction
● Development of Deep neural networks
7
○ Scalability and flexibility
■ It is possible that learn complex feature and non-linear
decision boundaries.
○ Their model complexity is very huge
■ Regularization of the networks is crucial to learn
meaningful representations of data.
8. Introduction
● In unsupervised representation learning
8
○ Target representations are not provided.
○ There are no constraining conditions.
➔ We need to regularize the networks in order to learn useful
representations that exhibit intended invariance for
applications of interest.
◆ e.g. ) invariance to small perturbations or affine transformation
9. Introduction | In this paper
● Use data augmentation to model the invariance of
learned data representations
9
○ Map data points into their discrete representations by a deep
neural network.
○ Regularize it by encouraging its prediction to be invariant to data
augmentation.
10. 10
● Self-Augmented Training
(SAT)
Encourage the predicted
representations of augmented data
points to be close to those of the original
data points in end-to-end fashion.
● Regularized Information
Maximization (RIM)
Maximize information theoretic
dependency between inputs and their
mapped outputs, while regularizing the
mapping function.
Information Maximizing
Self-Augmented Training
11. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion
11
12. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion
12
13. Related work | Clustering & Hash Learning
● The representative clustering and hashing methods
○ K-means clustering and hashing [He+ 2013]
○ Gaussian mixture model clustering, iterative quantization [Gong+ 2013]
○ Minimal-loss hashing [Norouzi & Blei 2011]
13
These methods can only model linear boundaries between
different representations.
14. Related work | Clustering & Hash Learning
● Methods that can model the non-linearity of data
○ Kernel-based [Xu+ 2014; Kulis & Darrell 2019]
○ Spectral clustering [Xu+ 2014; Kulis & Darrell 2019]
14
They are difficult to scale to large dataset.
15. Related work | Clustering & Hash Learning
● Deep learning based approach
○ Clustering
15
■ To learn feature representations and
cluster assignments [Xie+ 2016]
■ Model the data generation process by using deep
generative models with Gaussian mixture models as
prior dist [Dilokthanakul+ 2016; Zheng+ 2016]
16. Related work | Clustering & Hash Learning
● Deep learning based approach
○ Hash learning
16
■ Supervised hash learning
[Xia+ 2014; Lai+ 2015; Zhang+ 2015; Xu+2015; Li+ 2015]
■ Unsupervised hash learning
● Stacked RBM [Salakhutdinov & Hinton 2009]
● Use DL for the mapping function [Erin Liong+ 2015]
17. Related work | Clustering & Hash Learning
● Deep learning based approach
○ Hash learning
17
■ These unsupervised methods did not explicitly
intended impose the invariance on the learned
representations.
■ The predicted representations may not be useful
for applications of interest.
18. Related work | Data Augmentation
● About data augmentation
○ In supervised and semisupervised learning
18
■ Applying data augmentation to a supervised learning problem
is equivalent to adding a regularization to the original cost
function. [Leen 1995]
■ Achieve state-of-the-art performance in applying data
augmentation to semi-supervised learning.
[Bachman+ 2014; Miyato+ 2016; Sajjadi+ 2016]
19. Related work | Data Augmentation
● About data augmentation
○ In unsupervised learning
19
■ Proposed to use data augmentation to model the invariance
of learned representations. [Donovitskiy+ 2014]
20. Related work | Data Augmentation
● Difference between Dosoviskiy+ and IMSAT
20
○ Directly imposes the invariance on the learned representations
■ Dosoviskiy+ imposes invariance on surrogate classes, not
directly on the learned representations.
○ Focuses on learning discrete representations that are directly
usable for clustering and hsh learning
■ Doviskiy+ focused on learning continuous representations
that are then used for other tasks such as classification and
clustering.
21. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 21
22. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 22
23. At the same time, it regularizes the complexity of the classifier. Let and
denote random variables for data and cluster assignments, respectively, where K is
the number of clusters.
Method | about RIM
The RIM [Gomes+ 2010] learns a following probabilistic classifier such that
mutual information [Cover and Thomas 2012] between inputs and cluster
assignments is maximized.
23
24. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 24
25. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 25
26. where . Let be
a random variable for the discrete representation.
Method | about IMSAT
● Information maximization for learning discrete representations
26
Extend the RIM and consider learning M-dimensional discrete representations of
data. Let the output domain be
27. Method | about IMSAT
● Information maximization for learning discrete representations
27
To learn a multi-output probabilistic classifier that maps similar
inputs into similar representations. And then model the conditional probability by
using deep neural network.
Under the model, inputs are conditionally independent given x:
28. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 28
29. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 29
30. Method | about IMSAT
● Regularization of deep neural networks via SAT
30
SAT uses data augmentation to impose the intended invariance on the data
representation. Let denote a pre-defined data augmentation under
which the data representations shuold be invariant. The regularization of SAT made
on data point x is
31. Method | about IMSAT
● Regularization of deep neural networks via SAT
31
SAT uses data augmentation to impose the intended invariance on the data
representation. Let denote a pre-defined data augmentation under
which the data representations shuold be invariant. The regularization of SAT made
on data point x is
The prediction of original
data point x
32. Method | about IMSAT
● Regularization of deep neural networks via SAT
32
SAT uses data augmentation to impose the intended invariance on the data
representation. Let denote a pre-defined data augmentation under
which the data representations shuold be invariant. The regularization of SAT made
on data point x is
The prediction of
augmented data point x
33. Method | about IMSAT
● Regularization of deep neural networks via SAT
33
The regularization by SAT is then the average of over all the
training data points:
The augmented function T means adding small perturbation r and can be expressed
by the following expression:
34. Method | about IMSAT
● Regularization of deep neural networks via SAT
34
The two representative regularization methods based on local perturbations
● Random Perturbation Training (RPT) [Bachman+ 2016]
● Virtual Adversarial Training (VAT) [Miyato+ 2016]
In VAT, perturbation r is chosen to be an adversarial direction:
35. Method | for Clustering
35
In clustering, we can directly apply the RIM.
By representing mutual information as the difference between marginal entropy and
conditional entropy [Cover & Thomas 2012], we have the objective to minimize:
The two entropy terms can be calculated as
36. Method | for Clustering
36
Here, h is the following entropy function:
● Increasing the marginal entropy H(Y)
○ Encourages the cluster sizes to be uniform
● Decreasing the conditional entropy H(Y|X)
○ Encourages unambiguous cluster assignments [Bridle+ 1991]
In the previous research shows that we can incorporate our prior knowledge on
cluster sizes by modifying H(Y) [Gomes+ 2010]
37. Method | for Clustering
37
H(Y) can be rewritten as follows:
Maximization of H(Y) is equivalent to minimization of KL, which encourages
predicted cluster dist pθ(y) to be close U.
Replaced U in KL with any specified class prior q(y) so that pθ(y) is encouraged to
be close to q(y). We consider the following constrained optimization problem:
38. Method | for Hash Learning
38
Considering the output space of the augmented data, this gives us
Follows from the definition of interaction information and the conditional
independence that
39. Method | for Hash Learning
39
In hash learning, each data point is mapped into a D-bit binary code. So the
original RIM is not directly applicable.
The computation of mutual information of D-bit binary code is intractable for large
D because it involves a summation over an exponential number of terms.
[Brown 2009] shows that mutual information can be expanded as the sum of
interaction information like:
40. Method | for Hash Learning
40
In summary, our approximated objective to minimize is
● First term
○ Regularizes the neural network
● Second term
○ Maximizes the mutual information between data and each hash bit
● Third term
○ Removes the redundancy among the hash bits
41. Method | Marginal Distribution
41
It is necessary to calculate the marginal distribution when computing mutual
information. This is computationally done using the entire dataset, which is not
suitable for using mini batch SGD. Therefore, we use an following approximation:
In the case of clustering, the approximated objective that we actually minimize is an
upper bound of the exact objective that we try to minimize.
42. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 42
43. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 43
45. Experiments | about implements
45
● Clustering
○ Set the network dimensionality to d-1200-1200-M
○ Use Softmax as output layer
● Hash learning
○ Use smaller network sizes to ensure fast computation of mapping
data info hash codes (will be shown later).
○ Use sigmoid as output layer
● Use Adam, ReLU, BatchNorm
52. Experiments | hash learning
52
● About dataset
○ MNIST / CIFAR-10
● About baseline models
○ Spectral hashing [Weiss+ 2009]
○ PCA-ITQ [Gong+ 2013]
○ Deep Hash [Erin Liong+ 2015]
○ Linear RIM / Deep RIM / IMSAT(VAT)
53. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 53
55. Experiments | hash learning
55
● About evaluation metric
○ Mean Average Precision (mAP)
○ Precision at N = 500 samples
○ Hamming distance
56. Contents
● Introduction
● Related work
● Method : IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion 56
57. Conclusion | IMSAT
57
● Proposed “IMSAT”
○ Information theoretic method for unsupervised discrete
representation learning using deep neural networks
● Directly introduce invariance to data augmentation in
an end-to-end fashion
○ Learn robust discrete representations for small perturbations and
affine transformations