• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Learning the Structure of Deep Sparse Graphical Models - Paper presentation
 

Learning the Structure of Deep Sparse Graphical Models - Paper presentation

on

  • 307 views

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - ...

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - http://arxiv.org/pdf/1001.0160.pdf

Presented at ETH Zürich.

Statistics

Views

Total Views
307
Views on SlideShare
307
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Learning the Structure of Deep Sparse Graphical Models - Paper presentation Learning the Structure of Deep Sparse Graphical Models - Paper presentation Presentation Transcript

  • LEARNING THE STRUCTUREOF DEEP SPARSE GRAPHICAL MODELS Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani Presented by Justinas Mišeikis Supervisor: Alexander Vezhnevets 1
  • DEEP BELIEF NETWORKS • Deep belief networks consist of multiple layers • Consists of visible and hidden nodes • Visible nodes are only on the outside layer and represent output • Nodes are linked using directional edges • Graphical model 2
  • DEEP BELIEF NETWORKS Hidden layers Visible layer 3
  • DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 4
  • DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 5
  • DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 6
  • DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 7
  • DEEP BELIEF NETWORKS Properties: • # of layers • # of nodes in the layer • Network connectivity. We allow connections to be established only in consecutive layers • Node types. Binary or continuous? 8
  • THE PROBLEM - DBN STRUCTURE• What is the best structure of DBN? - Number of hidden units in each layer - Number of hidden layers - Types of unit behaviour - Connectivity• This article presents a non-parametric Bayesian approach for learning the structure of a layered DBN 9
  • FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 10
  • FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 11
  • FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 12
  • FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 13
  • FINITE SINGLE LAYER NETWORKNetwork connectivity is represented using binary matrices.• Columns and rows represent nodes• Zero (non-filled) - no connectivity• One (filled) - a connection Hidden Hidden Visible Visible 14
  • FINITE SINGLE LAYER NETWORK• Network dimensions for a prior have to be defined in advance• How many hidden units there should be? - Not sure• Can we have an infinite amount of hidden units?• Solution: the Indian Buffet Process 15
  • THE INDIAN BUFFET PROCESSThe Indian buffet process (IBP) is a stochastic process defining aprobability distribution over equivalence classes of sparse binarymatrices with a finite number of rows and an unbounded numberof columns. *• Rows - customers (visible layer), finite number of units• Columns - dishes (hidden layer), unbounded number of countable units• The IBP creates sparse matrices with a posterior of finite number of non-zero columns. However during the learning process, matrix growth column-wise is unlimited.* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf 16
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 17
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 18
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 19
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 20
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ... Parameters: α and β ηk - number of previous customers that have tried the dishjth customer tries:• Previously tasted dish k with probability ηk / (j + β - 1)• Poisson distribution with param αβ / (j + β - 1) of new dishes 21
  • THE INDIAN BUFFET PROCESS Dishes 1st Customer tries 2 new dishesCustomers 2nd Customer tries 1 old dish + 2 new 3rd Customer tries 2 old dishes + 1 new 4th Customer tries 2 old dishes + 2 new 5th Customer tries 4 old dishes + 2 new ...If no more customers come in, marked binary matrix would definethe structure of the deep belief network. 22
  • MULTI LAYER NETWORK• Single-layer: hidden units are independent• Multi-layer: hidden units can be dependent• Solution: extend the IBP to have unlimited number of layers -> deep belief network with unbounded width and depthWhile a belief network with an infinitely-wide hidden layer can represent any probabilitydistribution arbitrarily closely, it is not necessarily a useful prior on such distributions.Without intra-layer connections, the the hidden units are independent a priori. This“shallowness” is a strong assumption that weakens the model in practice and theexplosion of recent literature on deep belief networks speaks to the empirical success ofbelief networks with more hidden structure. 23
  • CASCADING IBP• Cascading Indian Buffet Process builds a prior on belief networks that are unbounded in both width and depth• Prior has the following properties - Each of the “dishes” in the restaurant of the layer m are also “customers” in the restaurant of the layer m+1 - Columns in layer m binary matrix correspond to the rows in the layer m+1 binary matrix• The matrices in the CIBP are constructed in a sequence starting with m = 0, the visible layer• Number of non-zero columns in the matrix m+1 is determined entirely by active non-zero columns in the previous matrix m 24
  • CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total Layer 1 25
  • CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer Layer 1 Layer 2 26
  • CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total Layer 1 Layer 2 27
  • CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer Layer 1 Layer 2 Layer 3 28
  • CASCADING IBP• Layer 1 has 5 customers who tasted 5 dishes in total• Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer• These 5 customers in layer 2 taste 7 dishes in total• Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer• Continues until in one layer customers taste zero dishes ... Layer 1 Layer 2 Layer 3 29
  • CIBP PARAMETERS• Two main parameters: α and β• α - defines the expected in-degree of each unit, or number of parents• β - controls the expected out-degree, or number of children, by the following equation:• K(m) is number of columns in the layer m• α and β are layer specific, they are not constant in the whole network. They can be written as α(m) and β(m) 30
  • CIBP CONVERGENCE• Does CIBP eventually converge to create finite depth DBN? - Yes!• How? - Applying this transition distribution to the Markov chain:• It is simply a Poisson distribution with mean λ(K(m); α, β)• The absorbing state, where no ‘dishes’ are tasted, will always be reached• Full mathematical proof of the convergence is given in the appendix of the paper 31
  • CIBP CONVERGENCE α = 3, β = 1 32
  • CIBP BASED PRIOR SAMPLES 33
  • NODE TYPES• Nonlinear Gaussian belief network (NLGBN) framework is used. Distribution u = Gaussian noise precision ν + activation sum y• Then the noisy sum is transformed with sigmoid function σ(∙)• Black line shows the zero mean distribution• Blue line shows pre-sigmoid mean of -1• Red line shows pre-sigmoid mean of +1 Binary Gaussian Deterministic 34
  • INFERENCE: JOINT DISTRIBUTION precision of input data Gaussian noise activationsbias weights in-layer units NLGBPlayer number weights matrix distribution number of observations 35
  • MARKOV CHAIN MONTE CARLO* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC forMachine Learning. 2003.http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf 36
  • INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 37
  • INFERENCE• Task: find the posterior distribution over the structure and the parameters of the network• Conditioning is used in order to update the model part-by-part rather than modifying the whole model at each time instance• Process is split into four parts - Edges: Sample posterior distribution over its weight - Activations: sample from the posterior distributions over the Gaussian noise precision - Structure: sample ancestors of the visible units - Parameters: closely tied with hyper-parameters 38
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 39
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer 40
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 41
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 42
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 43
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 44
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row 45
  • SAMPLING FROM THE STRUCTURE Layer 2 First phase: • For each layerLayer 1 • For each unit in the layer • Check each connected unit in the layer m+1 indexed by k’ 1 2 1 1 0 • Calculate non-zero entries in the k’th column of binary matrix excluding entry in kth row • If the sum is zero, the unit k’ is a singleton parent 46
  • SAMPLING FROM THE STRUCTURE Layer 2 Second phase: • Considers only singletonsLayer 1 • Option a: add new parent • Option b: delete connection to child k • Decisions are made by the 1 2 1 1 0 Metropolis-Hastings operator using birth/death process • In the end: units that are not ancestors of the visible units are discarded 47
  • EXPERIMENTS• Three datasets of images were used for experiments - Olivetti faces - MNIST Digit data - Frey faces• Performance test - image reconstruction• Bottom halves of images were removed and the model had to reconstruct the missing data by ‘seeing’ only top half• Top-bottom approach was chosen instead of left-right because both faces and numbers have left-right symmetry making it easier 48
  • OLIVETTI FACES350 + 50 images of 40 distinct subjects, 64x64~3 hidden layers: around 70 units in each layer 49
  • OLIVETTI FACESRaw predictive fantasies from the model 50
  • MNIST DIGIT DATA 50 + 10 images of 10 digits, 28x28~3 hidden layers: 120, 100, 70 units in hidden layers 51
  • FREY FACES1865 + 100 images of a single face, different expressions, 20x28 ~3 hidden layers: 260, 120, 35 units in hidden layers 52
  • DISCUSSION• Addresses the issues with deep belief networks• Unites two areas of research: nonparametric Bayesian methods and deep belief networks• Introduced the cascading Indian buffet process to have unbounded number of layers• CIBP always converges• Result: algorithm learns the effective model complexity 53
  • DISCUSSION• Very processor intensive algorithm - finding reconstructions took ‘few hours of CPU time’• Much better than fixed dimensionality DPNs? 54
  • THANK YOU! 55