Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - http://arxiv.org/pdf/1001.0160.pdf
Presented at ETH Zürich.
08448380779 Call Girls In Civil Lines Women Seeking Men
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
1. LEARNING THE STRUCTURE
OF DEEP SPARSE GRAPHICAL
MODELS
Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani
Presented by Justinas Mišeikis
Supervisor: Alexander Vezhnevets
1
2. DEEP BELIEF NETWORKS
• Deep belief networks
consist of multiple layers
• Consists of visible and
hidden nodes
• Visible nodes are only on
the outside layer and
represent output
• Nodes are linked using
directional edges
• Graphical model
2
4. DEEP BELIEF NETWORKS
Properties:
• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?
4
5. DEEP BELIEF NETWORKS
Properties:
• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?
5
6. DEEP BELIEF NETWORKS
Properties:
• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?
6
7. DEEP BELIEF NETWORKS
Properties:
• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?
7
8. DEEP BELIEF NETWORKS
Properties:
• # of layers
• # of nodes in the layer
• Network connectivity. We
allow connections to be
established only in
consecutive layers
• Node types. Binary or
continuous?
8
9. THE PROBLEM - DBN STRUCTURE
• What is the best structure of DBN?
- Number of hidden units in each layer
- Number of hidden layers
- Types of unit behaviour
- Connectivity
• This article presents a non-parametric Bayesian approach for
learning the structure of a layered DBN
9
10. FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-filled) - no connectivity
• One (filled) - a connection
Hidden
Hidden
Visible
Visible
10
11. FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-filled) - no connectivity
• One (filled) - a connection
Hidden
Hidden
Visible
Visible
11
12. FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-filled) - no connectivity
• One (filled) - a connection
Hidden
Hidden
Visible
Visible
12
13. FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-filled) - no connectivity
• One (filled) - a connection
Hidden
Hidden
Visible
Visible
13
14. FINITE SINGLE LAYER NETWORK
Network connectivity is represented using binary matrices.
• Columns and rows represent nodes
• Zero (non-filled) - no connectivity
• One (filled) - a connection
Hidden
Hidden
Visible
Visible
14
15. FINITE SINGLE LAYER NETWORK
• Network dimensions for a prior have to be defined in advance
• How many hidden units there should be?
- Not sure
• Can we have an infinite amount of hidden units?
• Solution: the Indian Buffet Process
15
16. THE INDIAN BUFFET PROCESS
The Indian buffet process (IBP) is a stochastic process defining a
probability distribution over equivalence classes of sparse binary
matrices with a finite number of rows and an unbounded number
of columns. *
• Rows - customers (visible layer), finite number of units
• Columns - dishes (hidden layer), unbounded number of
countable units
• The IBP creates sparse matrices with a posterior of finite
number of non-zero columns. However during the learning
process, matrix growth column-wise is unlimited.
* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011
http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf
16
17. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
...
Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
17
18. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
2nd Customer tries 1 old dish + 2 new
...
Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
18
19. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
2nd Customer tries 1 old dish + 2 new
3rd Customer tries 2 old dishes + 1 new
...
Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
19
20. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
2nd Customer tries 1 old dish + 2 new
3rd Customer tries 2 old dishes + 1 new
4th Customer tries 2 old dishes + 2 new
...
Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
20
21. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
2nd Customer tries 1 old dish + 2 new
3rd Customer tries 2 old dishes + 1 new
4th Customer tries 2 old dishes + 2 new
5th Customer tries 4 old dishes + 2 new
...
Parameters: α and β
ηk - number of previous customers
that have tried the dish
jth customer tries:
• Previously tasted dish k with probability ηk / (j + β - 1)
• Poisson distribution with param αβ / (j + β - 1) of new dishes
21
22. THE INDIAN BUFFET PROCESS
Dishes
1st Customer tries 2 new dishes
Customers
2nd Customer tries 1 old dish + 2 new
3rd Customer tries 2 old dishes + 1 new
4th Customer tries 2 old dishes + 2 new
5th Customer tries 4 old dishes + 2 new
...
If no more customers come in, marked binary matrix would define
the structure of the deep belief network.
22
23. MULTI LAYER NETWORK
• Single-layer: hidden units are independent
• Multi-layer: hidden units can be dependent
• Solution: extend the IBP to have unlimited number of layers
-> deep belief network with unbounded width and depth
While a belief network with an infinitely-wide hidden layer can represent any probability
distribution arbitrarily closely, it is not necessarily a useful prior on such distributions.
Without intra-layer connections, the the hidden units are independent a priori. This
“shallowness” is a strong assumption that weakens the model in practice and the
explosion of recent literature on deep belief networks speaks to the empirical success of
belief networks with more hidden structure.
23
24. CASCADING IBP
• Cascading Indian Buffet Process builds a prior on belief
networks that are unbounded in both width and depth
• Prior has the following properties
- Each of the “dishes” in the restaurant of the layer m are
also “customers” in the restaurant of the layer m+1
- Columns in layer m binary matrix correspond to the rows in
the layer m+1 binary matrix
• The matrices in the CIBP are constructed in a sequence starting
with m = 0, the visible layer
• Number of non-zero columns in the matrix m+1 is determined
entirely by active non-zero columns in the previous matrix m
24
26. CASCADING IBP
• Layer 1 has 5 customers who tasted 5 dishes in total
• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer
Layer 1 Layer 2
26
27. CASCADING IBP
• Layer 1 has 5 customers who tasted 5 dishes in total
• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer
• These 5 customers in layer 2 taste 7 dishes in total
Layer 1 Layer 2
27
28. CASCADING IBP
• Layer 1 has 5 customers who tasted 5 dishes in total
• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer
• These 5 customers in layer 2 taste 7 dishes in total
• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer
Layer 1 Layer 2 Layer 3
28
29. CASCADING IBP
• Layer 1 has 5 customers who tasted 5 dishes in total
• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer
• These 5 customers in layer 2 taste 7 dishes in total
• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer
• Continues until in one layer customers taste zero dishes
...
Layer 1 Layer 2 Layer 3
29
30. CIBP PARAMETERS
• Two main parameters: α and β
• α - defines the expected in-degree of each unit, or number of
parents
• β - controls the expected out-degree, or number of children, by
the following equation:
• K(m) is number of columns in the layer m
• α and β are layer specific, they are not constant in the whole
network. They can be written as α(m) and β(m)
30
31. CIBP CONVERGENCE
• Does CIBP eventually converge to create finite depth DBN?
- Yes!
• How?
- Applying this transition distribution to the Markov chain:
• It is simply a Poisson distribution with mean λ(K(m); α, β)
• The absorbing state, where no ‘dishes’ are tasted, will always
be reached
• Full mathematical proof of the convergence is given in the
appendix of the paper
31
34. NODE TYPES
• Nonlinear Gaussian belief network (NLGBN) framework is used.
Distribution u = Gaussian noise precision ν + activation sum y
• Then the noisy sum is transformed with sigmoid function σ(∙)
• Black line shows the zero mean distribution
• Blue line shows pre-sigmoid mean of -1
• Red line shows pre-sigmoid mean of +1
Binary Gaussian Deterministic
34
35. INFERENCE: JOINT DISTRIBUTION
precision of
input data
Gaussian noise
activations
bias weights
in-layer units NLGBP
layer number weights matrix distribution
number of
observations
35
36. MARKOV CHAIN MONTE CARLO
* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC for
Machine Learning. 2003.
http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf
36
37. INFERENCE
• Task: find the posterior distribution over the structure and the
parameters of the network
• Conditioning is used in order to update the model part-by-part
rather than modifying the whole model at each time instance
• Process is split into four parts
- Edges: Sample posterior distribution over its weight
- Activations: sample from the posterior distributions over
the Gaussian noise precision
- Structure: sample ancestors of the visible units
- Parameters: closely tied with hyper-parameters
37
38. INFERENCE
• Task: find the posterior distribution over the structure and the
parameters of the network
• Conditioning is used in order to update the model part-by-part
rather than modifying the whole model at each time instance
• Process is split into four parts
- Edges: Sample posterior distribution over its weight
- Activations: sample from the posterior distributions over
the Gaussian noise precision
- Structure: sample ancestors of the visible units
- Parameters: closely tied with hyper-parameters
38
39. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
39
40. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
40
41. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
41
42. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
1 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row
42
43. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
1 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row
43
44. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
1 2 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row
44
45. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
1 2 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row
45
46. SAMPLING FROM THE STRUCTURE
Layer 2
First phase:
• For each layer
Layer 1
• For each unit in the layer
• Check each connected unit
in the layer m+1 indexed
by k’
1 2 1 1 0 • Calculate non-zero entries
in the k’th column of binary
matrix excluding entry in
kth row
• If the sum is zero, the unit
k’ is a singleton parent
46
47. SAMPLING FROM THE STRUCTURE
Layer 2
Second phase:
• Considers only singletons
Layer 1
• Option a: add new parent
• Option b: delete
connection to child k
• Decisions are made by the
1 2 1 1 0 Metropolis-Hastings
operator using birth/death
process
• In the end: units that are
not ancestors of the visible
units are discarded
47
48. EXPERIMENTS
• Three datasets of images were used for experiments
- Olivetti faces
- MNIST Digit data
- Frey faces
• Performance test - image reconstruction
• Bottom halves of images were removed and the model had to
reconstruct the missing data by ‘seeing’ only top half
• Top-bottom approach was chosen instead of left-right because
both faces and numbers have left-right symmetry making it
easier
48
49. OLIVETTI FACES
350 + 50 images of 40 distinct subjects, 64x64
~3 hidden layers: around 70 units in each layer
49
51. MNIST DIGIT DATA
50 + 10 images of 10 digits, 28x28
~3 hidden layers: 120, 100, 70 units in hidden layers
51
52. FREY FACES
1865 + 100 images of a single face, different expressions, 20x28
~3 hidden layers: 260, 120, 35 units in hidden layers
52
53. DISCUSSION
• Addresses the issues with deep belief networks
• Unites two areas of research: nonparametric Bayesian methods
and deep belief networks
• Introduced the cascading Indian buffet process to have
unbounded number of layers
• CIBP always converges
• Result: algorithm learns the effective model complexity
53
54. DISCUSSION
• Very processor intensive algorithm
- finding reconstructions took ‘few hours of CPU time’
• Much better than fixed dimensionality DPNs?
54