Your SlideShare is downloading. ×
0
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Learning the Structure of Deep Sparse Graphical Models - Paper presentation

234

Published on

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - …

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - http://arxiv.org/pdf/1001.0160.pdf

Presented at ETH Zürich.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
234
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. LEARNING THE STRUCTUREOF DEEP SPARSE GRAPHICAL MODELS Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani Presented by Justinas Mišeikis Supervisor: Alexander Vezhnevets 1
    • 2. DEEP BELIEF NETWORKS • Deep belief networks consist of multiple layers • Consists of visible and hidden nodes • Visible nodes are only on the outside layer and represent output • Nodes are linked using directional edges • Graphical model 2
    • 3. DEEP BELIEF NETWORKS Hidden layers Visible layer 3
    • 4. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 4
    • 5. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 5
    • 6. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 6
    • 7. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 7
    • 8. DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 8
    • 9. THE PROBLEM - DBN STRUCTURE• What is the best structure of DBN? - Number of hidden units in each layer - Number of hidden layers - Types of unit behaviour - Connectivity• This article presents a non-parametric Bayesian approach for learning the structure of a layered DBN 9
    • 10. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 10
    • 11. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 11
    • 12. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 12
    • 13. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 13
    • 14. FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 14
    • 15. FINITE SINGLE LAYER NETWORK• Network dimensions for a prior have to be defined in advance• How many hidden units there should be? - Not sure• Can we have an infinite amount of hidden units?• Solution: the Indian Buffet Process 15
    • 16. THE INDIAN BUFFET PROCESSThe Indian buffet process (IBP) is a stochastic process defining aprobability distribution over equivalence classes of sparse binarymatrices with a finite number of rows and an unbounded numberof columns. *• Rows - customers (visible layer), finite number of units• Columns - dishes (hidden layer), unbounded number of countable units• The IBP creates sparse matrices with a posterior of finite number of non-zero columns. However during the learning process, matrix growth column-wise is unlimited.* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf 16
    • 17. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 17
    • 18. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 18
    • 19. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 19
    • 20. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 20
    • 21. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 21
    • 22. THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ...If no more customers come in, marked binary matrix would definethe structure of the deep belief network. 22
    • 23. MULTI LAYER NETWORK• Single-layer: hidden units are independent• Multi-layer: hidden units can be dependent• Solution: extend the IBP to have unlimited number of layers -> deep belief network with unbounded width and depthWhile a belief network with an infinitely-wide hidden layer can represent any probabilitydistribution arbitrarily closely, it is not necessarily a useful prior on such distributions.Without intra-layer connections, the the hidden units are independent a priori. This“shallowness” is a strong assumption that weakens the model in practice and theexplosion of recent literature on deep belief networks speaks to the empirical success ofbelief networks with more hidden structure. 23
    • 24. CASCADING IBP• Cascading Indian Buffet Process builds a prior on belief networks that are unbounded in both width and depth• Prior has the following properties - Each of the “dishes” in the restaurant of the layer m are also “customers” in the restaurant of the layer m+1 - Columns in layer m binary matrix correspond to the rows in the layer m+1 binary matrix• The matrices in the CIBP are constructed in a sequence starting with m = 0, the visible layer• Number of non-zero columns in the matrix m+1 is determined entirely by active non-zero columns in the previous matrix m 24
    • 25. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total Layer 1 25
    • 26. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer Layer 1 Layer 2 26
    • 27. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total Layer 1 Layer 2 27
    • 28. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer Layer 1 Layer 2 Layer 3 28
    • 29. CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer• Continues until in one layer customers taste zero dishes ... Layer 1 Layer 2 Layer 3 29
    • 30. CIBP PARAMETERS• Two main parameters: α and β• α - defines the expected in-degree of each unit, or number of parents• β - controls the expected out-degree, or number of children, by the following equation:• K(m) is number of columns in the layer m• α and β are layer specific, they are not constant in the whole network. They can be written as α(m) and β(m) 30
    • 31. CIBP CONVERGENCE• Does CIBP eventually converge to create finite depth DBN? - Yes!• How? - Applying this transition distribution to the Markov chain:• It is simply a Poisson distribution with mean λ(K(m); α, β)• The absorbing state, where no ‘dishes’ are tasted, will always be reached• Full mathematical proof of the convergence is given in the appendix of the paper 31
    • 32. CIBP CONVERGENCE α = 3, β = 1 32
    • 33. CIBP BASED PRIOR SAMPLES 33
    • 34. NODE TYPES• Nonlinear Gaussian belief network (NLGBN) framework is used. Distribution u = Gaussian noise precision ν + activation sum y• Then the noisy sum is transformed with sigmoid function σ(∙)• Black line shows the zero mean distribution• Blue line shows pre-sigmoid mean of -1• Red line shows pre-sigmoid mean of +1 Binary Gaussian Deterministic 34
    • 35. INFERENCE: JOINT DISTRIBUTION precision of input data Gaussian noise activationsbias weights in-layer units NLGBPlayer number weights matrix distribution number of observations 35
    • 36. MARKOV CHAIN MONTE CARLO* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC forMachine Learning. 2003.http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf 36
    • 37. INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 37
    • 38. INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 38
    • 39. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 39
    • 40. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer 40
    • 41. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 41
    • 42. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 42
    • 43. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 43
    • 44. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 44
    • 45. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 45
    • 46. SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 1 1 0 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row • If the sum is zero, the unit k’ is a singleton parent 46
    • 47. SAMPLING FROM THE STRUCTURE Layer 2 Second phase: • Considers only singletonsLayer 1 • Option a: add new parent • Option b: delete connection to child k • Decisions are made by the 1 2 1 1 0 Metropolis-Hastings operator using birth/death process • In the end: units that are not ancestors of the visible units are discarded 47
    • 48. EXPERIMENTS• Three datasets of images were used for experiments - Olivetti faces - MNIST Digit data - Frey faces• Performance test - image reconstruction• Bottom halves of images were removed and the model had to reconstruct the missing data by ‘seeing’ only top half• Top-bottom approach was chosen instead of left-right because both faces and numbers have left-right symmetry making it easier 48
    • 49. OLIVETTI FACES350 + 50 images of 40 distinct subjects, 64x64~3 hidden layers: around 70 units in each layer 49
    • 50. OLIVETTI FACESRaw predictive fantasies from the model 50
    • 51. MNIST DIGIT DATA 50 + 10 images of 10 digits, 28x28~3 hidden layers: 120, 100, 70 units in hidden layers 51
    • 52. FREY FACES1865 + 100 images of a single face, different expressions, 20x28 ~3 hidden layers: 260, 120, 35 units in hidden layers 52
    • 53. DISCUSSION• Addresses the issues with deep belief networks• Unites two areas of research: nonparametric Bayesian methods and deep belief networks• Introduced the cascading Indian buffet process to have unbounded number of layers• CIBP always converges• Result: algorithm learns the effective model complexity 53
    • 54. DISCUSSION• Very processor intensive algorithm - finding reconstructions took ‘few hours of CPU time’• Much better than fixed dimensionality DPNs? 54
    • 55. THANK YOU! 55

    ×