Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning

1
Learning Discrete Representations via
Information Maximizing
Self-Augmented Training
Weihua Hu, Takeru Miyato, Seiya Tokui,
Eiichi Matsumoto, Masashi Sugiyama
Intelligent Information processing II
Nov 20, 2017
University of Tokyo, RIKEN AIP, Preferred Networks, Inc.
Proceedings of the 34th International Conference on Machine Learning
Presented by Shunsuke KITADA

The reason why I chose this paper
● With unsupervised learning achieved high accuracy (98%!)
in MNIST classification.
● Published from the University of Tokyo (Sugiyama lab)
and Preferred Networks.
● VAT is used as effective regularization term.
● Accepted by ICML 2017.
2

Contents
● Introduction
● Related work
● Method ： IMSAT = IM + SAT
○ Information Maximization (IM)
○ Self-Augmented Training (SAT)
● Experiments
● Conclusion
3

Contents
● Introduction
● Related work
● Experiments
● Conclusion
4

Introduction
● Unsupervised discrete representation Learning
5
○ To obtain a function that maps similar (or dissimilar) data into
similar (or dissimilar) discrete representations.
○ The similarity of data is defined according to applications of
interests.

Introduction
● Clustering and Hash learning
6
○ Clustering
■ Widely applied to data-driven application
domains. [Berkhin 2006]
○ Hash learning
■ Popular for an approximate nearest neighbor search for
large scale information retrieval. [Wang+ 2016]

Introduction
● Development of Deep neural networks
7
○ Scalability and flexibility
■ It is possible that learn complex feature and non-linear
decision boundaries.
○ Their model complexity is very huge
■ Regularization of the networks is crucial to learn
meaningful representations of data.

Introduction
● In unsupervised representation learning
8
○ Target representations are not provided.
○ There are no constraining conditions.
➔ We need to regularize the networks in order to learn useful
representations that exhibit intended invariance for
applications of interest.
◆ e.g. ) invariance to small perturbations or affine transformation

Introduction | In this paper
● Use data augmentation to model the invariance of
learned data representations
9
○ Map data points into their discrete representations by a deep
neural network.
○ Regularize it by encouraging its prediction to be invariant to data
augmentation.

10
● Self-Augmented Training
(SAT)
Encourage the predicted
representations of augmented data
points to be close to those of the original
data points in end-to-end fashion.
● Regularized Information
Maximization (RIM)
Maximize information theoretic
dependency between inputs and their
mapped outputs, while regularizing the
mapping function.
Information Maximizing
Self-Augmented Training

Contents
● Introduction
● Related work
● Experiments
● Conclusion
11

Contents
● Introduction
● Related work
● Experiments
● Conclusion
12

Related work | Clustering & Hash Learning
● The representative clustering and hashing methods
○ K-means clustering and hashing [He+ 2013]
○ Gaussian mixture model clustering, iterative quantization [Gong+ 2013]
○ Minimal-loss hashing [Norouzi & Blei 2011]
13
These methods can only model linear boundaries between
different representations.

● Methods that can model the non-linearity of data
○ Kernel-based [Xu+ 2014; Kulis & Darrell 2019]
○ Spectral clustering [Xu+ 2014; Kulis & Darrell 2019]
14
They are difficult to scale to large dataset.

● Deep learning based approach
○ Clustering
15
■ To learn feature representations and
cluster assignments [Xie+ 2016]
■ Model the data generation process by using deep
generative models with Gaussian mixture models as
prior dist [Dilokthanakul+ 2016; Zheng+ 2016]

○ Hash learning
16
■ Supervised hash learning
[Xia+ 2014; Lai+ 2015; Zhang+ 2015; Xu+2015; Li+ 2015]
■ Unsupervised hash learning
● Stacked RBM [Salakhutdinov & Hinton 2009]
● Use DL for the mapping function [Erin Liong+ 2015]

○ Hash learning
17
■ These unsupervised methods did not explicitly
intended impose the invariance on the learned
representations.
■ The predicted representations may not be useful
for applications of interest.

Related work | Data Augmentation
● About data augmentation
○ In supervised and semisupervised learning
18
■ Applying data augmentation to a supervised learning problem
is equivalent to adding a regularization to the original cost
function. [Leen 1995]
■ Achieve state-of-the-art performance in applying data
augmentation to semi-supervised learning.
[Bachman+ 2014; Miyato+ 2016; Sajjadi+ 2016]

● About data augmentation
○ In unsupervised learning
19
■ Proposed to use data augmentation to model the invariance
of learned representations. [Donovitskiy+ 2014]

● Difference between Dosoviskiy+ and IMSAT
20
○ Directly imposes the invariance on the learned representations
■ Dosoviskiy+ imposes invariance on surrogate classes, not
directly on the learned representations.
○ Focuses on learning discrete representations that are directly
usable for clustering and hsh learning
■ Doviskiy+ focused on learning continuous representations
that are then used for other tasks such as classification and
clustering.

Contents
● Introduction
● Related work
● Experiments
● Conclusion 21

Contents
● Introduction
● Related work
● Experiments
● Conclusion 22

At the same time, it regularizes the complexity of the classifier. Let and 　　
denote random variables for data and cluster assignments, respectively, where K is
the number of clusters.
Method | about RIM
The RIM [Gomes+ 2010] learns a following probabilistic classifier such that
mutual information [Cover and Thomas 2012] between inputs and cluster
assignments is maximized.
23

Contents
● Introduction
● Related work
● Experiments
● Conclusion 24

Contents
● Introduction
● Related work
● Experiments
● Conclusion 25

where . Let be
a random variable for the discrete representation.
Method | about IMSAT
● Information maximization for learning discrete representations
26
Extend the RIM and consider learning M-dimensional discrete representations of
data. Let the output domain be

● Information maximization for learning discrete representations
27
To learn a multi-output probabilistic classifier that maps similar
inputs into similar representations. And then model the conditional probability by
using deep neural network.
Under the model, inputs are conditionally independent given x:

Contents
● Introduction
● Related work
● Experiments
● Conclusion 28

Contents
● Introduction
● Related work
● Experiments
● Conclusion 29

● Regularization of deep neural networks via SAT
30
SAT uses data augmentation to impose the intended invariance on the data
representation. Let denote a pre-defined data augmentation under
which the data representations shuold be invariant. The regularization of SAT made
on data point x is

31
on data point x is
The prediction of original
data point x

32
on data point x is
The prediction of
augmented data point x

33
The regularization by SAT is then the average of over all the
training data points:
The augmented function T means adding small perturbation r and can be expressed
by the following expression:

34
The two representative regularization methods based on local perturbations
● Random Perturbation Training (RPT) [Bachman+ 2016]
● Virtual Adversarial Training (VAT) [Miyato+ 2016]
In VAT, perturbation r is chosen to be an adversarial direction:

Method | for Clustering
35
In clustering, we can directly apply the RIM.
By representing mutual information as the difference between marginal entropy and
conditional entropy [Cover & Thomas 2012], we have the objective to minimize:
The two entropy terms can be calculated as

36
Here, h is the following entropy function:
● Increasing the marginal entropy H(Y)
○ Encourages the cluster sizes to be uniform
● Decreasing the conditional entropy H(Y|X)
○ Encourages unambiguous cluster assignments [Bridle+ 1991]
In the previous research shows that we can incorporate our prior knowledge on
cluster sizes by modifying H(Y) [Gomes+ 2010]

37
H(Y) can be rewritten as follows:
Maximization of H(Y) is equivalent to minimization of KL, which encourages
predicted cluster dist pθ(y) to be close U.
Replaced U in KL with any specified class prior q(y) so that pθ(y) is encouraged to
be close to q(y). We consider the following constrained optimization problem:

Method | for Hash Learning
38
Considering the output space of the augmented data, this gives us
Follows from the definition of interaction information and the conditional
independence that

39
In hash learning, each data point is mapped into a D-bit binary code. So the
original RIM is not directly applicable.
The computation of mutual information of D-bit binary code is intractable for large
D because it involves a summation over an exponential number of terms.
[Brown 2009] shows that mutual information can be expanded as the sum of
interaction information like:

40
In summary, our approximated objective to minimize is
● First term
○ Regularizes the neural network
● Second term
○ Maximizes the mutual information between data and each hash bit
● Third term
○ Removes the redundancy among the hash bits

Method | Marginal Distribution
41
It is necessary to calculate the marginal distribution when computing mutual
information. This is computationally done using the entire dataset, which is not
suitable for using mini batch SGD. Therefore, we use an following approximation:
In the case of clustering, the approximated objective that we actually minimize is an
upper bound of the exact objective that we try to minimize.

Contents
● Introduction
● Related work
● Experiments
● Conclusion 42

Contents
● Introduction
● Related work
● Experiments
● Conclusion 43

Experiments | Overview
44
● About implements
● About clustering
● About hash learning

Experiments | about implements
45
● Clustering
○ Set the network dimensionality to d-1200-1200-M
○ Use Softmax as output layer
● Hash learning
○ Use smaller network sizes to ensure fast computation of mapping
data info hash codes (will be shown later).
○ Use sigmoid as output layer
● Use Adam, ReLU, BatchNorm

Experiments | clustering
46
● About baseline models

47
● About datasets

48
● About evaluation metric
○ Evaluate with Unsupervised clustering accuracy (ACC)

49
● Experiment result

50

51

Experiments | hash learning
52
● About dataset
○ MNIST / CIFAR-10
● About baseline models
○ Spectral hashing [Weiss+ 2009]
○ PCA-ITQ [Gong+ 2013]
○ Deep Hash [Erin Liong+ 2015]
○ Linear RIM / Deep RIM / IMSAT(VAT)

Contents
● Introduction
● Related work
● Experiments
● Conclusion 53

54

55
● About evaluation metric
○ Mean Average Precision (mAP)
○ Precision at N = 500 samples
○ Hamming distance

Contents
● Introduction
● Related work
● Experiments
● Conclusion 56

Conclusion | IMSAT
57
● Proposed “IMSAT”
○ Information theoretic method for unsupervised discrete
representation learning using deep neural networks
● Directly introduce invariance to data augmentation in
an end-to-end fashion
○ Learn robust discrete representations for small perturbations and
affine transformations

Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning

Similar to Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning (20)

Recently uploaded

Recently uploaded (20)

Information Maximizing Self-Augmented Training for Unsupervised Discrete Representation Learning