1) Linear factor models represent observed data vectors as a linear combination of latent factors plus noise. They include probabilistic principal component analysis (PCA) and factor analysis.
2) Independent component analysis learns components that are closer to statistically independent than the raw features, and can separate signals like voices or EEG signals.
3) Sparse coding finds a sparse representation of data by solving an optimization problem that minimizes a factor's value and reconstruction error, producing sparse weights.
1. Linear Factor Models
Lecture slides for Chapter 13 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
2. (Goodfellow 2016)
Linear Factor Models
se is typically Gaussian and diagonal (independent acros
ated in figure 13.1.
h1
h1 h2
h2 h3
h3
x1
x1 x2
x2 x3
x3
x = W h + b + noise
x = W h + b + noise
he directed graphical model describing the linear factor m
me that an observed data vector x is obtained by a linear
atent factors h, plus some noise. Different models, such a
Figure 13.1
3. (Goodfellow 2016)
Probabilistic PCA and Factor Analysis
• Linear factor model
• Gaussian prior
• Extends PCA
• Given an input, yields a distribution over codes, rather
than a single code
• Estimates a probability density function
• Can generate samples
4. (Goodfellow 2016)
Independent Components
Analysis
• Factorial but non-Gaussian prior
• Learns components that are closer to statistically
independent than the raw features
• Can be used to separate voices of n speakers
recorded by n microphones, or to separate multiple
EEG signals
• Many variants, some more probabilistic than others
5. (Goodfellow 2016)
Slow Feature Analysis
• Learn features that change gradually over time
• SFA algorithm does so in closed form for a linear
model
• Deep SFA by composing many models with fixed
feature expansions, like quadratic feature expansion
6. (Goodfellow 2016)
Sparse Coding
near factor models, it uses a linear decoder plus noise to
of x, as specified in equation 13.2. More specifically, sparse
y assume that the linear factors have Gaussian noise with
p(x | h) = N(x; W h + b,
1
I). (13.12)
h) is chosen to be one with sharp peaks near 0 (Olshausen
mon choices include factorized Laplace, Cauchy or factorized
. For example, the Laplace prior parametrized in terms of
efficient is given by
p(hi) = Laplace(hi; 0,
2
) =
4
e
1
2
|hi|
(13.13)
cally assume that the linear factors have Gaussian noise with
:
p(x | h) = N(x; W h + b,
1
I). (13.12)
n p(h) is chosen to be one with sharp peaks near 0 (Olshausen
ommon choices include factorized Laplace, Cauchy or factorized
ions. For example, the Laplace prior parametrized in terms of
y coefficient is given by
p(hi) = Laplace(hi; 0,
2
) =
4
e
1
2
|hi|
(13.13)
prior by
p(hi) /
1
(1 +
h2
i
⌫ )
⌫+1
2
. (13.14)
496
e coding is not a parametric encoder. Instead, the encoder
ithm, that solves an optimization problem in which we seek
ode value:
h⇤
= f(x) = arg max
h
p(h | x). (13.15)
quation 13.13 and equation 13.12, this yields the following
arg max
h
p(h | x) (13.16)
= arg max
h
log p(h | x) (13.17)
= arg min
h
||h||1 + ||x W h||2
2, (13.18)
terms not depending on h and divided by positive scaling
7. (Goodfellow 2016)
Sparse Coding
PTER 13. LINEAR FACTOR MODELS
re 13.2: Example samples and weights from a spike and slab sparse coding m
ed on the MNIST dataset. (Left)The samples from the model do not resembl
Samples Weights
Figure 13.2
8. (Goodfellow 2016)
Manifold Interpretation of
PCA
CHAPTER 13. LINEAR FACTOR MODELS
The encoder computes a low-dimensional representation of h. With the autoencoder
view, we have a decoder computing the reconstruction
x̂ = g(h) = b + V h. (13.20)
Figure 13.3: Flat Gaussian capturing probability concentration near a low-dimensional
manifold. The figure shows the upper half of the “pancake” above the “manifold plane”
which goes through its middle. The variance in the direction orthogonal to the manifold is
Figure 13.3