16_graphical_models.pdf

Structured Probabilistic
Models for Deep Learning
Lecture slides for Chapter 16 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-10-04

(Goodfellow 2017)
Roadmap
• Challenges of Unstructured Modeling
• Using Graphs to Describe Model Structure
• Sampling from Graphical Models
• Advantages of Structured Modeling
• Structure Learning and Latent Variables
• Inference and Approximate Inference
• The Deep Learning Approach to Structured Probabilistic Modeling

(Goodfellow 2017)
Tasks for Generative Models
• Density estimation
• Denoising
• Sample generation
• Missing value imputation
• Conditional sample generation
• Conditional density estimation

(Goodfellow 2017)
Samples from a BEGAN
(Berthelot et al, 2017)
Images are 128 pixels wide, 128 pixels tall
R, G, and B pixel at each location.

(Goodfellow 2017)
Cost of Tabular Approach
g P(x) by storing
quires kn param
not feasible for s
Number of values per variable
Number of variables
For BEGAN faces: 256
For BEGAN faces:
128 ⇥ 128 = 16384
There are roughly ten to the power of forty thousand
times more points in the discretized domain of the BEGAN
face model than there are atoms in the universe.

(Goodfellow 2017)
Tabular Approach is Infeasible
• Memory: cannot store that many parameters
• Runtime: inference and sampling are both slow
• Statistical eﬃciency: extremely high number of
parameters requires extremely high number of
training examples

(Goodfellow 2017)
Insight of Model Structure
• Most variables influence each other
• Most variables do not influence each other directly
• Describe influence with a graph
• Edges represent direct influence
• Paths represent indirect influence
• Computational and statistical savings come from omissions
of edges

(Goodfellow 2017)
Directed Models
ED PROBABILISTIC MODELS FOR DEEP LEA
t0
t0 t1
t1 t2
t2
Alice Bob Carol
phical model depicting the relay race example. A
shing time t1, because Bob does not get to start
arol only gets to start running after Bob finis
fluences Carol’s finishing time t2.
Figure 16.2
elay race example from section 16.1, suppose we name
Bob’s finishing time t1, and Carol’s finishing time t2.
mate of t1 depends on t0. Our estimate of t2 depends
rectly on t0. We can draw this relationship in a directed
d in figure 16.2.
raphical model defined on variables x is defined by a
hose vertices are the random variables in the model, and
al probability distributions p(xi | PaG(xi)), where
of xi in G. The probability distribution over x is given
p(x) = ⇧ip(xi | PaG(xi)). (16.1)
le, this means that, using the graph drawn in figure 16.2,
, t1, t2) = p(t0)p(t1 | t0)p(t2 | t1). (16.2)
e t0, Bob’s finishing time t1, and Carol’s finishing time t2.
ur estimate of t1 depends on t0. Our estimate of t2 depends
y indirectly on t0. We can draw this relationship in a directed
strated in figure 16.2.
ted graphical model defined on variables x is defined by a
h G whose vertices are the random variables in the model, and
itional probability distributions p(xi | PaG(xi)), where
rents of xi in G. The probability distribution over x is given
p(x) = ⇧ip(xi | PaG(xi)). (16.1)
xample, this means that, using the graph drawn in figure 16.2,
p(t0, t1, t2) = p(t0)p(t1 | t0)p(t2 | t1). (16.2)
time seeing a structured probabilistic model in action. We
t of using it, to observe how structured modeling has many
Directed models work best when influence
clearly flows in one direction

(Goodfellow 2017)
Undirected Models
RUCTURED PROBABILISTIC MODELS FOR DEE
hr
hr hy
hy hc
hc
undirected graph representing how your roomma
r work colleague’s health hc aﬀect each other. You
other with a cold, and you and your work colleagu
Undirected models work best when influence
has no clear direction or is best modeled as
flowing in both directions
Do you have a cold?
Does your
roommate have a
cold?
Does your work
colleague have a
cold?

(Goodfellow 2017)
Undirected Models
ph G. For each clique C in the graph,3 a factor (C)
al) measures the affinity of the variables in that clique
ssible joint states. The factors are constrained to be
efine an unnormalized probability distribution
p̃(x) = ⇧C2G (C). (16.3)
bility distribution is efficient to work with so long as
encodes the idea that states with higher affinity are
in a Bayesian network, there is little structure to the
here is nothing to guarantee that multiplying them
obability distribution. See figure 16.4 for an example
mation from an undirected graph.
unction
ability distribution is guaranteed to be nonnegative
teed to sum or integrate to 1. To obtain a valid
must use the corresponding normalized probability
p(x) =
1
Z
p̃(x), (16.4)
esults in the probability distribution summing or
Z =
Z
p̃(x)dx. (16.5)
unction
ability distribution is guaranteed to be nonnegative
teed to sum or integrate to 1. To obtain a valid
must use the corresponding normalized probability
p(x) =
1
Z
p̃(x), (16.4)
esults in the probability distribution summing or
Z =
Z
p̃(x)dx. (16.5)
Unnormalized probability
Partition function

(Goodfellow 2017)
Separation
R 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARN
a s b a s b
(a) (b)
6: (a) The path between random variable a and random variable b t
cause s is not observed. This means that a and b are not separated.
in, to indicate that it is observed. Because the only path between
and that path is inactive, we can conclude that a and b are separat
Separation and D-Separation
When s is not observed,
influence can flow from a
to b and vice versa through s.
When s is observed,
it blocks the flow of
influence between a
and b: they are
separated

(Goodfellow 2017)
Separation example
HAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARN
a
b c
d
igure 16.7: An example of reading separation properties from an undirected gr
is shaded to indicate that it is observed. Because observing b blocks the only
to c, we say that a and c are separated from each other given b. The observ
so blocks one path between a and d, but there is a second, active path betw
herefore, a and d are not separated given b.
The nodes a and c are separated
One path between a and d is still active,
though the other path is blocked, so these
two nodes are not separated.

(Goodfellow 2017)
d-separation
The flow of influence is more complicated for directed models
The path between a and b is active for all of these graphs:
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a s b
a s b a s b
(a) (b)
a
s
b a s b
c
(c) (d)
a s b
a s b a s b
(a) (b)
a
s
b a s b
c
(c) (d)
Figure 16.8: All the kinds of active paths of length two that can exist between random
variables a and b. (a) Any path with arrows proceeding directly from a to b or vice versa.
This kind of path becomes blocked if s is observed. We have already seen this kind of
path in the relay race example. (b) Variables a and b are connected by a common cause s.
a s b
a s b a s b
(a) (b)
a
s
b a s b
c
a s b
a s b a s b
(a) (b)
a
s
b a s b
c
(c) (d)
Figure 16.8: All the kinds of active paths of length two that can exist between random

(Goodfellow 2017)
d-separation example
a b
c
d e
graph, we can read out several d-separation properties. Examples
parated given the empty set
parated given c
parated given c
some variables are no longer d-separated when we observe some
d
Figure 16.9: From this graph, we can read out sever
include:
• a and b are d-separated given the empty set
• a and e are d-separated given c
• d and e are d-separated given c
We can also see that some variables are no longer
variables:
• a and b are not d-separated given c
• a and b are not d-separated given d
c
d
Figure 16.9: From this graph, we can read ou
include:
• a and b are d-separated given the empt
• a and e are d-separated given c
• d and e are d-separated given c
We can also see that some variables are no
variables:
• a and b are not d-separated given c
• a and b are not d-separated given d
Observing variables can activate paths!

(Goodfellow 2017)
A complete graph can represent
any probability distribution
R 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARN
10: Examples of complete graphs, which can describe any probability d
how examples with four random variables. (Left) The complete undirec
directed case, the complete graph is unique. (Right) A complete direc
ected case, there is not a unique complete graph. We choose an orde
and draw an arc from each variable to every variable that comes afte
The benefits of graphical models come from omitting edges

(Goodfellow 2017)
Converting between graphs
• Any specific probability distribution can be
represented by either an undirected or a directed
graph
• Some probability distributions have conditional
independences that one kind of graph fails to imply
(the distribution is simpler than the graph
describes; need to know the conditional probability
distributions to see the independences)

(Goodfellow 2017)
Converting directed to
undirected
APTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
h1
h1 h2
h2 h3
h3
v1
v1 v2
v2 v3
v3
a b
c
a
c
b
h1
h1 h2
h2 h3
h3
v1
v1 v2
v2 v3
v3
a b
c
a
c
b
ure 16.11: Examples of converting directed models (top row) to undirected models
Must add an edge between
unconnected coparents

(Goodfellow 2017)
Converting undirected to
directed
ER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a b
d c
a b
d c
a b
d c
16.12: Converting an undirected model to a directed model. (Left) This und
cannot be converted to a directed model because it has a loop of length fo
ds. Specifically, the undirected model encodes two diﬀerent independenc
cted model can capture simultaneously: a?c | {b, d} and b?d | {a, c}. (C
vert the undirected model to a directed model, we must triangulate the
uring that all loops of greater than length three have a chord. To do so,
No loops of
length
greater than
three allowed!
Add edges to
triangulate
long loops
Assign
directions to
edges. No
directed cycles
allowed.

(Goodfellow 2017)
Factor graphs are less
ambiguous
a b
c
a b
c
f1
f1
a b
c
f1
f1
f2
f2
f3
f3
Figure 16.13: An example of how a factor graph can resolve ambiguity in the interpretation
of undirected networks. (Left) An undirected network with a clique involving three
variables: a, b and c. (Center) A factor graph corresponding to the same undirected
model. This factor graph has one factor over all three variables. (Right) Another valid
actor graph for the same undirected model. This factor graph has three factors, each
over only two variables. Representation, inference, and learning are all asymptotically
heaper in this factor graph than in the factor graph depicted in the center, even though
Undirected graph: is
this three pairwise
potentials or one
potential over three
variables?
Factor graphs
disambiguate by
placing each potential
in the graph

(Goodfellow 2017)
Sampling from directed models
• Easy and fast to draw fair samples from the whole
model
• Ancestral sampling: pass through the graph in
topological order. Sample each node given its
parents.
• Harder to sample some nodes given other nodes,
unless the observed nodes are at the start of the
topology

(Goodfellow 2017)
Sampling from undirected
models
• Usually requires Markov chains
• Usually cannot be done exactly
• Usually requires multiple iterations even to
approximate
• Described in Chapter 17

(Goodfellow 2017)
Tabular Case
• Assume each node has a tabular distribution given its parents
• Memory, sampling, inference are now exponential in number of
variables in factor with largest scope
• For many interesting models, this is very small
• e.g., RBMs: all factor scopes are size 2 or 1
• Previously, these costs were exponential in total number of nodes
• Statistically, much easier to estimate this manageable number of
parameters

(Goodfellow 2017)
Learning about dependencies
• Suppose we have thousands of variables
• Maybe gene expression data
• Some interact
• Some do not
• We do not know which ahead of time

(Goodfellow 2017)
Structure learning strategy
• Try out several graphs
• See which graph does best job of some criterion
• Fitting training set with small model complexity
• Fitting validation set
• Iterative search, propose new graphs similar to best
graph so far (remove edge / add edge / flip edge)

(Goodfellow 2017)
Latent variable strategy
• Use one graph structure
• Many latent variables
• Dense connections of latent variables to observed variables
• Parameters learn that each latent variable interacts
strongly with only a small subset of observed variables
• Trainable just with gradient descent; no discrete search
over graphs

(Goodfellow 2017)
Inference and Approximate
Inference
• Inferring marginal distribution over some nodes or
conditional distribution of some nodes given other nodes
is #P hard
• NP-hardness describes decision problems. #P-
hardness describes counting problems, e.g., how many
solutions are there to a problem where finding one
solution is NP-hard
• We usually rely on approximate inference, described in
chapter 19

(Goodfellow 2017)
Deep Learning Stylistic
Tendencies
• Nodes organized into layers
• High amount of connectivity between layers
• Examples: RBMs, DBMs, GANs, VAEs
h1
h1 h2
h2 h3
h3
v1
v1 v2
v2 v3
v3
h4
h4
Figure 16.14: An RBM drawn as a Markov network.

(Goodfellow 2017)
For more information…

16_graphical_models.pdf

Recommended

Recommended

More Related Content

Similar to 16_graphical_models.pdf

Similar to 16_graphical_models.pdf (20)

More from KSChidanandKumarJSSS

More from KSChidanandKumarJSSS (8)

Recently uploaded

Recently uploaded (20)

16_graphical_models.pdf