2. (Goodfellow 2016)
Structure of an Autoencoder
ns on the reconstructed input. Recirculation is re
sible than back-propagation, but is rarely used for m
x
x r
r
h
h
f g
general structure of an autoencoder, mapping an inpu
tion) r through an internal representation or code h.
Figure 14.1
Input
Hidden layer (code)
Reconstruction
3. (Goodfellow 2016)
Stochastic Autoencoders
s probability distribution is inexpensive to e
mixture density outputs allow tractable mo
x
x r
r
h
h
pencoder(h | x) pdecoder(x | h)
cture of a stochastic autoencoder, in which both
le functions but instead involve some noise injec
Figure 14.2
4. (Goodfellow 2016)
Avoiding Trivial Identity
• Undercomplete autoencoders
• h has lower dimension than x
• f or g has low capacity (e.g., linear g)
• Must discard some information in h
• Overcomplete autoencoders
• h has higher dimension than x
• Must be regularized
6. (Goodfellow 2016)
Sparse Autoencoders
• Limit capacity of autoencoder by adding a term to the cost
function penalizing the code for being larger
• Special case of variational autoencoder
• Probabilistic model
• Laplace prior corresponds to L1 sparsity penalty
• Dirac variational posterior
7. (Goodfellow 2016)
Denoising Autoencoder
14. AUTOENCODERS
x̃
x̃ L
L
h
h
f
g
x
x
C(x̃ | x)
3: The computational graph of the cost function for a denoising au
rained to reconstruct the clean data point x from its corrupted
Figure 14.3
C: corruption process
(introduce noise)
x̃
x̃ L
L
h
h
f
g
x
x
C(x̃ | x)
re 14.3: The computational graph of the cost function for a denoising autoenc
h is trained to reconstruct the clean data point x from its corrupted versi
is accomplished by minimizing the loss L = log pdecoder(x | h = f(x̃)), w
a corrupted version of the data example x, obtained through a given corru
ess C(x̃ | x). Typically the distribution pdecoder is a factorial distribution whose
meters are emitted by a feedforward network g.
upted samples x̃, given a data sample x. The autoencoder then lear
8. (Goodfellow 2016)
Denoising Autoencoders Learn
a Manifold
x
x̃
g f
x̃
C(x̃ | x)
x
Figure 14.4: A denoising autoencoder is trained to map a corrupted data point x̃ back to
the original data point x. We illustrate training examples x as red crosses lying near a
low-dimensional manifold illustrated with the bold black line. We illustrate the corruption
Figure 14.4
9. (Goodfellow 2016)
Score Matching
• Score:
• Fit a density model by matching score of model to
score of data
• Some denoising autoencoders are equivalent to score
matching applied to some density models
ng the Score
värinen, 2005) is an alternative to maximum likelihood. It
t estimator of probability distributions based on encouraging
e same score as the data distribution at every training point
he score is a particular gradient field:
rx log p(x). (14.15)
iscussed further in section 18.4. For the present discussion
ders, it is sufficient to understand that learning the gradient
ne way to learn the structure of pdata itself.
nt property of DAEs is that their training criterion (with
ian p(x | h)) makes the autoencoder learn a vector field
stimates the score of the data distribution. This is illustrated
ng of a specific kind of autoencoder (sigmoidal hidden units,
10. (Goodfellow 2016)
Vector Field Learned by a
Denoising Autoencoder
by training with the squared error criterion
||g(f(x̃)) x||2
(14.16)
and corruption
C(x̃ = x̃|x) = N(x̃; µ = x, ⌃ = 2
I) (14.17)
with noise variance 2. See figure 14.5 for an illustration of how this works.
Figure 14.5: Vector field learned by a denoising autoencoder around a 1-D curved manifold
near which the data concentrates in a 2-D space. Each arrow is proportional to the
reconstruction minus input vector of the autoencoder and points towards higher probability
according to the implicitly estimated probability distribution. The vector field has zeros
Figure 14.5
12. (Goodfellow 2016)
Learning a Collection of 0-D
Manifolds by Resisting Perturbation
HAPTER 14. AUTOENCODERS
x0 x1 x2
x
0.0
0.2
0.4
0.6
0.8
1.0
r(x)
Identity
Optimal reconstruction
Figure 14.7: If the autoencoder learns a reconstruction function that is invariant to small
erturbations near the data points, it captures the manifold structure of the data. Here
he manifold structure is a collection of 0-dimensional manifolds. The dashed diagonal
Figure 14.7
13. (Goodfellow 2016)
Non-Parametric Manifold Learning
with Nearest-Neighbor Graphs
CHAPTER 14. AUTOENCODERS
Figure 14.8: Non-parametric manifold learning procedures build a nearest neighbor graph
in which nodes represent training examples a directed edges indicate nearest neighbor
Figure 14.8
14. (Goodfellow 2016)
Tiling a Manifold with Local
Coordinate Systems
CHAPTER 14. AUTOENCODERS
Figure 14.9: If the tangent planes (see figure 14.6) at each location are known, then they
can be tiled to form a global coordinate system or a density function. Each local patch
Figure 14.9
15. (Goodfellow 2016)
Contractive Autoencoders
ve Autoencoders
oder (Rifai et al., 2011a,b) introduces an explicit regularizer
encouraging the derivatives of f to be as small as possible:
⌦(h) =
@f(x)
@x
2
F
. (14.18)
squared Frobenius norm (sum of squared elements) of the
ial derivatives associated with the encoder function.
n between the denoising autoencoder and the contractive
Bengio (2013) showed that in the limit of small Gaussian
sing reconstruction error is equivalent to a contractive
ruction function that maps x to r = g(f(x)). In other
ncoders make the reconstruction function resist small but
ns of the input, while contractive autoencoders make the
tion resist infinitesimal perturbations of the input. When
HAPTER 14. AUTOENCODERS
Input
point
Tangent vectors
Local PCA (no sharing across regions)
Contractive autoencoder
gure 14.10: Illustration of tangent vectors of the manifold estimated by local PC
d by a contractive autoencoder. The location on the manifold is defined by the inp
Figure 14.10