Learning the Structure of Deep Sparse Graphical Models - Paper presentation

LEARNING THE STRUCTURE
OF DEEP SPARSE GRAPHICAL
MODELS
Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani

Presented by Justinas Mišeikis
Supervisor: Alexander Vezhnevets

1

DEEP BELIEF NETWORKS
• Deep belief networks
consist of multiple layers
• Consists of visible and
hidden nodes
• Visible nodes are only on
the outside layer and
represent output
• Nodes are linked using
directional edges
• Graphical model

2


Hidden layers

Visible layer

3

Properties:

• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?

4

Properties:

• # of layers
established only in
consecutive layers
continuous?

5

Properties:

• # of layers
established only in
consecutive layers
continuous?

6

Properties:

• # of layers
established only in
consecutive layers
continuous?

7

Properties:

• # of layers
established only in
consecutive layers
continuous?

8

THE PROBLEM - DBN STRUCTURE
• What is the best structure of DBN?
- Number of hidden units in each layer
- Number of hidden layers
- Types of unit behaviour
- Connectivity
• This article presents a non-parametric Bayesian approach for
learning the structure of a layered DBN

9

FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-ﬁlled) - no connectivity
• One (ﬁlled) - a connection

Hidden
Hidden
Visible

Visible

10


Hidden
Hidden
Visible

Visible

11


Hidden
Hidden
Visible

Visible

12


Hidden
Hidden
Visible

Visible

13


Hidden
Hidden
Visible

Visible

14

• Network dimensions for a prior have to be deﬁned in advance
• How many hidden units there should be?
- Not sure
• Can we have an inﬁnite amount of hidden units?
• Solution: the Indian Buffet Process

15

THE INDIAN BUFFET PROCESS
The Indian buffet process (IBP) is a stochastic process defining a
probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number
of columns. *
• Rows - customers (visible layer), finite number of units
• Columns - dishes (hidden layer), unbounded number of
countable units
• The IBP creates sparse matrices with a posterior of finite
number of non-zero columns. However during the learning
process, matrix growth column-wise is unlimited.
* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011
http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf

16

Dishes
1st Customer tries 2 new dishes
Customers

...

Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
17

Dishes
Customers

2nd Customer tries 1 old dish + 2 new

...

jth customer tries:
18

Dishes
Customers

3rd Customer tries 2 old dishes + 1 new

...

jth customer tries:
19

Dishes
Customers

4th Customer tries 2 old dishes + 2 new

...

jth customer tries:
20

Dishes
Customers


...

jth customer tries:
21

Dishes
Customers


...

If no more customers come in, marked binary matrix would deﬁne
the structure of the deep belief network.

22

MULTI LAYER NETWORK
• Single-layer: hidden units are independent
• Multi-layer: hidden units can be dependent
• Solution: extend the IBP to have unlimited number of layers
-> deep belief network with unbounded width and depth

While a belief network with an inﬁnitely-wide hidden layer can represent any probability
distribution arbitrarily closely, it is not necessarily a useful prior on such distributions.
Without intra-layer connections, the the hidden units are independent a priori. This
“shallowness” is a strong assumption that weakens the model in practice and the
explosion of recent literature on deep belief networks speaks to the empirical success of
belief networks with more hidden structure.

23

CASCADING IBP
• Cascading Indian Buffet Process builds a prior on belief
networks that are unbounded in both width and depth
• Prior has the following properties
- Each of the “dishes” in the restaurant of the layer m are
also “customers” in the restaurant of the layer m+1
- Columns in layer m binary matrix correspond to the rows in
the layer m+1 binary matrix
• The matrices in the CIBP are constructed in a sequence starting
with m = 0, the visible layer
• Number of non-zero columns in the matrix m+1 is determined
entirely by active non-zero columns in the previous matrix m

24

CASCADING IBP
• Layer 1 has 5 customers who tasted 5 dishes in total

Layer 1

25

CASCADING IBP
• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer

Layer 1 Layer 2

26

CASCADING IBP
• These 5 customers in layer 2 taste 7 dishes in total

Layer 1 Layer 2

27

CASCADING IBP

Layer 1 Layer 2 Layer 3

28

CASCADING IBP
• Continues until in one layer customers taste zero dishes

...

Layer 1 Layer 2 Layer 3

29

CIBP PARAMETERS
• Two main parameters: α and β
• α - deﬁnes the expected in-degree of each unit, or number of
parents
• β - controls the expected out-degree, or number of children, by
the following equation:
• K(m) is number of columns in the layer m
• α and β are layer speciﬁc, they are not constant in the whole
network. They can be written as α(m) and β(m)

30

CIBP CONVERGENCE
• Does CIBP eventually converge to create ﬁnite depth DBN?
- Yes!
• How?
- Applying this transition distribution to the Markov chain:

• It is simply a Poisson distribution with mean λ(K(m); α, β)
• The absorbing state, where no ‘dishes’ are tasted, will always
be reached
• Full mathematical proof of the convergence is given in the
appendix of the paper

31

CIBP CONVERGENCE

α = 3, β = 1
32

CIBP BASED PRIOR SAMPLES

33

NODE TYPES
• Nonlinear Gaussian belief network (NLGBN) framework is used.
Distribution u = Gaussian noise precision ν + activation sum y
• Then the noisy sum is transformed with sigmoid function σ(∙)
• Black line shows the zero mean distribution
• Blue line shows pre-sigmoid mean of -1
• Red line shows pre-sigmoid mean of +1

Binary Gaussian Deterministic
34

INFERENCE: JOINT DISTRIBUTION
precision of
input data
Gaussian noise
activations
bias weights

in-layer units NLGBP
layer number weights matrix distribution
number of
observations
35

MARKOV CHAIN MONTE CARLO

* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC for
Machine Learning. 2003.
http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf

36

INFERENCE
• Task: ﬁnd the posterior distribution over the structure and the
parameters of the network
• Conditioning is used in order to update the model part-by-part
rather than modifying the whole model at each time instance
• Process is split into four parts
- Edges: Sample posterior distribution over its weight
- Activations: sample from the posterior distributions over
the Gaussian noise precision
- Structure: sample ancestors of the visible units
- Parameters: closely tied with hyper-parameters

37

INFERENCE
• Task: ﬁnd the posterior distribution over the structure and the
parameters of the network
• Conditioning is used in order to update the model part-by-part
rather than modifying the whole model at each time instance
• Process is split into four parts
- Edges: Sample posterior distribution over its weight
- Activations: sample from the posterior distributions over
the Gaussian noise precision
- Structure: sample ancestors of the visible units
- Parameters: closely tied with hyper-parameters

38

SAMPLING FROM THE STRUCTURE
Layer 2

First phase:
• For each layer
Layer 1

39

Layer 2

First phase:
• For each layer
Layer 1

• For each unit in the layer

40

Layer 2

First phase:
• For each layer
Layer 1

• Check each connected unit
in the layer m+1 indexed
by k’

41

Layer 2

First phase:
• For each layer
Layer 1

by k’
1 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row

42

Layer 2

First phase:
• For each layer
Layer 1

by k’
1 • Calculate non-zero entries
kth row

43

Layer 2

First phase:
• For each layer
Layer 1

by k’
1 2 • Calculate non-zero entries
kth row

44

Layer 2

First phase:
• For each layer
Layer 1

by k’
1 2 • Calculate non-zero entries
kth row

45

Layer 2

First phase:
• For each layer
Layer 1

by k’
1 2 1 1 0 • Calculate non-zero entries
kth row
• If the sum is zero, the unit
k’ is a singleton parent

46

Layer 2

Second phase:
• Considers only singletons
Layer 1

• Option a: add new parent
• Option b: delete
connection to child k
• Decisions are made by the
1 2 1 1 0 Metropolis-Hastings
operator using birth/death
process
• In the end: units that are
not ancestors of the visible
units are discarded

47

EXPERIMENTS
• Three datasets of images were used for experiments
- Olivetti faces
- MNIST Digit data
- Frey faces
• Performance test - image reconstruction
• Bottom halves of images were removed and the model had to
reconstruct the missing data by ‘seeing’ only top half
• Top-bottom approach was chosen instead of left-right because
both faces and numbers have left-right symmetry making it
easier

48

OLIVETTI FACES
350 + 50 images of 40 distinct subjects, 64x64
~3 hidden layers: around 70 units in each layer

49

OLIVETTI FACES
Raw predictive fantasies from the model

50

MNIST DIGIT DATA
50 + 10 images of 10 digits, 28x28
~3 hidden layers: 120, 100, 70 units in hidden layers

51

FREY FACES
1865 + 100 images of a single face, different expressions, 20x28
~3 hidden layers: 260, 120, 35 units in hidden layers

52

DISCUSSION
• Addresses the issues with deep belief networks
• Unites two areas of research: nonparametric Bayesian methods
and deep belief networks
• Introduced the cascading Indian buffet process to have
unbounded number of layers
• CIBP always converges
• Result: algorithm learns the effective model complexity

53

DISCUSSION
• Very processor intensive algorithm
- ﬁnding reconstructions took ‘few hours of CPU time’
• Much better than ﬁxed dimensionality DPNs?

54

Learning the Structure of Deep Sparse Graphical Models - Paper presentation

Recommended

Recommended

More Related Content

More from Justas Miseikis

More from Justas Miseikis (9)

Recently uploaded

Recently uploaded (20)

Learning the Structure of Deep Sparse Graphical Models - Paper presentation

Editor's Notes