Learning the Structure of Deep Sparse Graphical Models - Paper presentation

714 views

Published on

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - http://arxiv.org/pdf/1001.0160.pdf

Presented at ETH Zürich.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
714
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Learning the Structure of Deep Sparse Graphical Models - Paper presentation

    1. 1. LEARNING THE STRUCTUREOF DEEP SPARSE GRAPHICAL MODELS Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani Presented by Justinas Mišeikis Supervisor: Alexander Vezhnevets 1
    2. 2. DEEP BELIEF NETWORKS • Deep belief networks consist of multiple layers • Consists of visible and hidden nodes • Visible nodes are only on the outside layer and represent output • Nodes are linked using directional edges • Graphical model 2
    3. 3. DEEP BELIEF NETWORKS Hidden layers Visible layer 3
    4. 4. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 4
    5. 5. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 5
    6. 6. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 6
    7. 7. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 7
    8. 8. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 8
    9. 9. THE PROBLEM - DBN STRUCTURE• What is the best structure of DBN? - Number of hidden units in each layer - Number of hidden layers - Types of unit behaviour - Connectivity• This article presents a non-parametric Bayesian approach for learning the structure of a layered DBN 9
    10. 10. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 10
    11. 11. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 11
    12. 12. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 12
    13. 13. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 13
    14. 14. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 14
    15. 15. FINITE SINGLE LAYER NETWORK• Network dimensions for a prior have to be defined in advance• How many hidden units there should be? - Not sure• Can we have an infinite amount of hidden units?• Solution: the Indian Buffet Process 15
    16. 16. THE INDIAN BUFFET PROCESSThe Indian buffet process (IBP) is a stochastic process defining aprobability distribution over equivalence classes of sparse binarymatrices with a finite number of rows and an unbounded numberof columns. *• Rows - customers (visible layer), finite number of units• Columns - dishes (hidden layer), unbounded number of countable units• The IBP creates sparse matrices with a posterior of finite number of non-zero columns. However during the learning process, matrix growth column-wise is unlimited.* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf 16
    17. 17. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 17
    18. 18. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 18
    19. 19. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 19
    20. 20. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 20
    21. 21. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 21
    22. 22. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ...If no more customers come in, marked binary matrix would definethe structure of the deep belief network. 22
    23. 23. MULTI LAYER NETWORK• Single-layer: hidden units are independent• Multi-layer: hidden units can be dependent• Solution: extend the IBP to have unlimited number of layers -> deep belief network with unbounded width and depthWhile a belief network with an infinitely-wide hidden layer can represent any probabilitydistribution arbitrarily closely, it is not necessarily a useful prior on such distributions.Without intra-layer connections, the the hidden units are independent a priori. This“shallowness” is a strong assumption that weakens the model in practice and theexplosion of recent literature on deep belief networks speaks to the empirical success ofbelief networks with more hidden structure. 23
    24. 24. CASCADING IBP• Cascading Indian Buffet Process builds a prior on belief networks that are unbounded in both width and depth• Prior has the following properties - Each of the “dishes” in the restaurant of the layer m are also “customers” in the restaurant of the layer m+1 - Columns in layer m binary matrix correspond to the rows in the layer m+1 binary matrix• The matrices in the CIBP are constructed in a sequence starting with m = 0, the visible layer• Number of non-zero columns in the matrix m+1 is determined entirely by active non-zero columns in the previous matrix m 24
    25. 25. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total Layer 1 25
    26. 26. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer Layer 1 Layer 2 26
    27. 27. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total Layer 1 Layer 2 27
    28. 28. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer Layer 1 Layer 2 Layer 3 28
    29. 29. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer• Continues until in one layer customers taste zero dishes ... Layer 1 Layer 2 Layer 3 29
    30. 30. CIBP PARAMETERS• Two main parameters: α and β• α - defines the expected in-degree of each unit, or number of parents• β - controls the expected out-degree, or number of children, by the following equation:• K(m) is number of columns in the layer m• α and β are layer specific, they are not constant in the whole network. They can be written as α(m) and β(m) 30
    31. 31. CIBP CONVERGENCE• Does CIBP eventually converge to create finite depth DBN? - Yes!• How? - Applying this transition distribution to the Markov chain:• It is simply a Poisson distribution with mean λ(K(m); α, β)• The absorbing state, where no ‘dishes’ are tasted, will always be reached• Full mathematical proof of the convergence is given in the appendix of the paper 31
    32. 32. CIBP CONVERGENCE α = 3, β = 1 32
    33. 33. CIBP BASED PRIOR SAMPLES 33
    34. 34. NODE TYPES• Nonlinear Gaussian belief network (NLGBN) framework is used. Distribution u = Gaussian noise precision ν + activation sum y• Then the noisy sum is transformed with sigmoid function σ(∙)• Black line shows the zero mean distribution• Blue line shows pre-sigmoid mean of -1• Red line shows pre-sigmoid mean of +1 Binary Gaussian Deterministic 34
    35. 35. INFERENCE: JOINT DISTRIBUTION precision of input data Gaussian noise activationsbias weights in-layer units NLGBPlayer number weights matrix distribution number of observations 35
    36. 36. MARKOV CHAIN MONTE CARLO* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC forMachine Learning. 2003.http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf 36
    37. 37. INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 37
    38. 38. INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 38
    39. 39. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 39
    40. 40. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer 40
    41. 41. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 41
    42. 42. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 42
    43. 43. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 43
    44. 44. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 44
    45. 45. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 45
    46. 46. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 1 1 0 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row • If the sum is zero, the unit k’ is a singleton parent 46
    47. 47. SAMPLING FROM THE STRUCTURE Layer 2 Second phase: • Considers only singletonsLayer 1 • Option a: add new parent • Option b: delete connection to child k • Decisions are made by the 1 2 1 1 0 Metropolis-Hastings operator using birth/death process • In the end: units that are not ancestors of the visible units are discarded 47
    48. 48. EXPERIMENTS• Three datasets of images were used for experiments - Olivetti faces - MNIST Digit data - Frey faces• Performance test - image reconstruction• Bottom halves of images were removed and the model had to reconstruct the missing data by ‘seeing’ only top half• Top-bottom approach was chosen instead of left-right because both faces and numbers have left-right symmetry making it easier 48
    49. 49. OLIVETTI FACES350 + 50 images of 40 distinct subjects, 64x64~3 hidden layers: around 70 units in each layer 49
    50. 50. OLIVETTI FACESRaw predictive fantasies from the model 50
    51. 51. MNIST DIGIT DATA 50 + 10 images of 10 digits, 28x28~3 hidden layers: 120, 100, 70 units in hidden layers 51
    52. 52. FREY FACES1865 + 100 images of a single face, different expressions, 20x28 ~3 hidden layers: 260, 120, 35 units in hidden layers 52
    53. 53. DISCUSSION• Addresses the issues with deep belief networks• Unites two areas of research: nonparametric Bayesian methods and deep belief networks• Introduced the cascading Indian buffet process to have unbounded number of layers• CIBP always converges• Result: algorithm learns the effective model complexity 53
    54. 54. DISCUSSION• Very processor intensive algorithm - finding reconstructions took ‘few hours of CPU time’• Much better than fixed dimensionality DPNs? 54
    55. 55. THANK YOU! 55

    ×