Bayesian inversion of deterministicdynamic causal modelsKay H. Brodersen1,2 & Ajita Gupta21   Department of Economics, Uni...
Overview1.    Problem setting      Model; likelihood; prior; posterior.2.    Variational Laplace      Factorization of pos...
1 Problem setting
Model = likelihood + prior                             4
Bayesian inference is conceptually straightforwardQuestion 1: what do the data tell us about themodel parameters?         ...
2 Variational Laplace
Variational Laplace in a nutshell Neg. free-energy       ln p  y | m   F  KL  q  ,   , p  ,  | y           ...
Variational LaplaceAssumptions                      mean-field approximation                         Laplace approximation...
Variational LaplaceInversion strategyRecall the relationship between the log model evidence and the negative free-energy 𝐹...
Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Newton’s Method for finding a root (1D)        ...
Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Compute Newton update (change)New estimate     ...
Newton’s method – demonstrationNewton’s method is very efficient. However, its solution is not insensitive to the starting...
Variational Laplace Nonlinear regression (example) Model (likelihood):                                  Ground truth      ...
Variational LaplaceNonlinear regression (example)We begin by defining our prior:   Prior density (   )   (ground truth)W. ...
Variational LaplaceNonlinear regression (example)Posterior density (   )          VL optimization (4 iterations)          ...
3 Sampling
SamplingDeterministic approximations                 Stochastic approximations computationally efficient                 ...
Strategy 1 – Transformation method   We can obtain samples from some distribution 𝑝 𝑧 by first sampling from the   uniform...
Strategy 2 – Rejection methodWhen the transformation method cannot be applied, we can resort to a moregeneral method calle...
Strategy 3 – Gibbs samplingOften the joint distribution of several random variables is unavailable, whereas thefull-condit...
Strategy 4 – Markov Chain Monte Carlo (MCMC)Idea: we can sample from a large class of distributions and overcome the probl...
MCMC demonstration: finding a good proposal densityWhen the proposal distribution is too                 When it is too wi...
MCMC for DCMMH creates as series of random points 𝜃 1 , 𝜃 2 , … , whosedistribution converges to the target distribution o...
MCMC for DCMW. Penny               24
MCMC – exampleW. Penny                 25
MCMC – exampleW. Penny                 26
MCMC – exampleMetropolis-Hastings   Variational-Laplace  W. Penny                                            27
4 Model comparison
Model evidenceW. Penny                 29
Prior arithmetic meanW. Penny                        30
Posterior harmonic meanW. Penny                          31
Savage-Dickey ratioIn many situations we wish to compare models that are nested. For example: 𝑚 𝐹 : full model with parame...
Comparison of methods                        MCMC                                    VLChumbley et al. (2007) NeuroImage  ...
Comparison of methods        estimated minus true log BFChumbley et al. (2007) NeuroImage                                 ...
Upcoming SlideShare
Loading in …5
×

Bayesian inversion of deterministic dynamic causal models

652 views
610 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
652
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bayesian inversion of deterministic dynamic causal models

  1. 1. Bayesian inversion of deterministicdynamic causal modelsKay H. Brodersen1,2 & Ajita Gupta21 Department of Economics, University of Zurich, Switzerland2 Department of Computer Science, ETH Zurich, Switzerland Mike West
  2. 2. Overview1. Problem setting Model; likelihood; prior; posterior.2. Variational Laplace Factorization of posterior; why the free-energy; energies; gradient ascent; adaptive step size; example.3. Sampling Transformation method; rejection method; Gibbs sampling; MCMC; Metropolis- Hastings; example.4. Model comparison Model evidence; Bayes factors; free-energy; prior arithmetic mean; posterior harmonic mean; Savage-Dickey; example.With material from Will Penny, Klaas Enno Stephan, Chris Bishop, and Justin Chumbley. 2
  3. 3. 1 Problem setting
  4. 4. Model = likelihood + prior 4
  5. 5. Bayesian inference is conceptually straightforwardQuestion 1: what do the data tell us about themodel parameters? ? compute the posterior 𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚 𝑝 𝜃 𝑦, 𝑚 = 𝑝 𝑦 𝑚Question 2: which model is best? compute the model evidence 𝑝 𝑚 𝑦 ∝ 𝑝 𝑦 𝑚 𝑝(𝑚) = ∫ 𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚 𝑑𝜃 ? 5
  6. 6. 2 Variational Laplace
  7. 7. Variational Laplace in a nutshell Neg. free-energy ln p  y | m   F  KL  q  ,   , p  ,  | y     approx. to model evidence. F  ln p  y,  ,   q  KL  q  ,   , p  ,  | m     Mean field approx. p  ,  | y   q  ,    q   q    Maximise neg. free q    exp  I   exp  ln p  y,  ,    energy wrt. q =  q ( )  q     exp  I    exp  ln p  y,  ,    minimise divergence, by maximising  q ( )  variational energies Iterative updating of sufficient statistics of approx. posteriors by gradient ascent.K.E. Stephan 7
  8. 8. Variational LaplaceAssumptions mean-field approximation Laplace approximationW. Penny 8
  9. 9. Variational LaplaceInversion strategyRecall the relationship between the log model evidence and the negative free-energy 𝐹:ln 𝑝 𝑦 𝑚 = 𝐸 𝑞 ln 𝑝 𝑦 𝜃 − 𝐾𝐿 𝑞 𝜃 |𝑝 𝜃 𝑚 + 𝐾𝐿 𝑞 𝜃 ||𝑝 𝜃 𝑦, 𝑚 =: 𝐹 ≥0Maximizing 𝐹 implies two things:(i) we obtain a good approximation to ln 𝑝 𝑦 𝑚(ii) the KL divergence between 𝑞 𝜃 and 𝑝 𝜃 𝑦, 𝑚 becomes minimalPractically, we can maximize F by iteratively (EM) maximizing the variational energies: W. Penny 9
  10. 10. Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Newton’s Method for finding a root (1D) f ( x(old ))x(new)  x(old )  f ( x(old ))Compute gradient vector 𝜕𝐼(𝜃) 𝑗𝜃 𝑖 = 𝜕𝜃(𝑖)Compute curvature matrix 𝜕 2 𝐼(𝜃) 𝐻𝜃 𝑖, 𝑗 = 𝜕𝜃(𝑖)𝜕𝜃(𝑗) 10
  11. 11. Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Compute Newton update (change)New estimate Newton’s Method for finding a root (1D) f ( x(old )) x(new)  x(old )  f ( x(old ))Big curvature -> small stepSmall curvature -> big stepW. Penny 11
  12. 12. Newton’s method – demonstrationNewton’s method is very efficient. However, its solution is not insensitive to the startingpoint, as shown above.http://demonstrations.wolfram.com/LearningNewtonsMethod/ 12
  13. 13. Variational Laplace Nonlinear regression (example) Model (likelihood): Ground truth (known parameter values that were used to generate the data Data: on the left): wherey(t) W. Penny t 13
  14. 14. Variational LaplaceNonlinear regression (example)We begin by defining our prior: Prior density ( ) (ground truth)W. Penny 14
  15. 15. Variational LaplaceNonlinear regression (example)Posterior density ( ) VL optimization (4 iterations) Starting point ( 2.9, 1.65 ) True value ( 3.4012, 2.0794 )W. Penny VL estimate ( 3.4, 2.1 ) 15
  16. 16. 3 Sampling
  17. 17. SamplingDeterministic approximations Stochastic approximations computationally efficient  asymptotically exact efficient representation  easily applicable general-purpose learning rules may give additional algorithms insight  computationally expensive application initially involves hard work  storage intensive systematic error 17
  18. 18. Strategy 1 – Transformation method We can obtain samples from some distribution 𝑝 𝑧 by first sampling from the uniform distribution and then transforming these samples. 1.4 2  yields high-quality 1.2 samples 1 1.5  easy to implement 0.8  computationally p(z)p(u) 1 efficient 0.6  obtaining the inverse 0.4 0.5 cdf can be difficult 0.2 𝑢 = 0.9 𝑧 = 1.15 0 0 0 0.5 1 0 1 2 3 4 transformation: 𝑧 𝜏 = 𝐹 −1 𝑢 𝜏 18
  19. 19. Strategy 2 – Rejection methodWhen the transformation method cannot be applied, we can resort to a moregeneral method called rejection sampling. Here, we draw random numbers from asimpler proposal distribution 𝑞(𝑧) and keep only some of these samples. The proposal distribution 𝑞(𝑧), scaled by a factor 𝑘, represents an envelope of 𝑝(𝑧).  yields high-quality samples  easy to implement  can be computationally efficient  computationally inefficient if proposal isBishop (2007) PRML a poor approximation 19
  20. 20. Strategy 3 – Gibbs samplingOften the joint distribution of several random variables is unavailable, whereas thefull-conditional distributions are available. In this case, we can cycle over full-conditionals to obtain samples from the joint distribution.  easy to implement  samples are serially correlated  the full-conditions may not be availableBishop (2007) PRML 20
  21. 21. Strategy 4 – Markov Chain Monte Carlo (MCMC)Idea: we can sample from a large class of distributions and overcome the problemsthat previous methods face in high dimensions using a framework called MarkovChain Monte Carlo. p(z*)Increase in density: p(zτ)  1 zτ z* p(zτ)Decrease in density: ~ ( z*) p p(z*) ~  p( z )M. Dümcke zτ z* 21
  22. 22. MCMC demonstration: finding a good proposal densityWhen the proposal distribution is too When it is too wide, we obtain longnarrow, we might miss a mode. constant stretches without an acceptance.http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/ 22
  23. 23. MCMC for DCMMH creates as series of random points 𝜃 1 , 𝜃 2 , … , whosedistribution converges to the target distribution of interest. For us, thisis the posterior density 𝑝(𝜃|𝑦).We could use the following proposal distribution: 𝑞 𝜃 𝜏 𝜃 𝜏−1 = 𝒩 0, 𝜎𝐶 𝜃 prior covariance scaling factor adapted such that acceptance rate is between 20% and 40%W. Penny 23
  24. 24. MCMC for DCMW. Penny 24
  25. 25. MCMC – exampleW. Penny 25
  26. 26. MCMC – exampleW. Penny 26
  27. 27. MCMC – exampleMetropolis-Hastings Variational-Laplace W. Penny 27
  28. 28. 4 Model comparison
  29. 29. Model evidenceW. Penny 29
  30. 30. Prior arithmetic meanW. Penny 30
  31. 31. Posterior harmonic meanW. Penny 31
  32. 32. Savage-Dickey ratioIn many situations we wish to compare models that are nested. For example: 𝑚 𝐹 : full model with parameters 𝜃 = 𝜃1 , 𝜃2 𝑚 𝑅 : reduced model with 𝜃 = (𝜃1 , 0)In this case, we can use the Savage-Dickey ratio to obtain a Bayes factorwithout having to compute the twomodel evidences: 𝐵 𝑅𝐹 = 0.9 𝑝 𝜃2 = 0 𝑦, 𝑚 𝐹 𝐵 𝑅𝐹 = 𝑝 𝜃2 = 0 𝑚 𝐹W. Penny 32
  33. 33. Comparison of methods MCMC VLChumbley et al. (2007) NeuroImage 33
  34. 34. Comparison of methods estimated minus true log BFChumbley et al. (2007) NeuroImage 34

×