Successfully reported this slideshow.
Upcoming SlideShare
×

# Bayesian inversion of deterministic dynamic causal models

882 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Bayesian inversion of deterministic dynamic causal models

1. 1. Bayesian inversion of deterministicdynamic causal modelsKay H. Brodersen1,2 & Ajita Gupta21 Department of Economics, University of Zurich, Switzerland2 Department of Computer Science, ETH Zurich, Switzerland Mike West
2. 2. Overview1. Problem setting Model; likelihood; prior; posterior.2. Variational Laplace Factorization of posterior; why the free-energy; energies; gradient ascent; adaptive step size; example.3. Sampling Transformation method; rejection method; Gibbs sampling; MCMC; Metropolis- Hastings; example.4. Model comparison Model evidence; Bayes factors; free-energy; prior arithmetic mean; posterior harmonic mean; Savage-Dickey; example.With material from Will Penny, Klaas Enno Stephan, Chris Bishop, and Justin Chumbley. 2
3. 3. 1 Problem setting
4. 4. Model = likelihood + prior 4
5. 5. Bayesian inference is conceptually straightforwardQuestion 1: what do the data tell us about themodel parameters? ? compute the posterior 𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚 𝑝 𝜃 𝑦, 𝑚 = 𝑝 𝑦 𝑚Question 2: which model is best? compute the model evidence 𝑝 𝑚 𝑦 ∝ 𝑝 𝑦 𝑚 𝑝(𝑚) = ∫ 𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚 𝑑𝜃 ? 5
6. 6. 2 Variational Laplace
7. 7. Variational Laplace in a nutshell Neg. free-energy ln p  y | m   F  KL  q  ,   , p  ,  | y     approx. to model evidence. F  ln p  y,  ,   q  KL  q  ,   , p  ,  | m     Mean field approx. p  ,  | y   q  ,    q   q    Maximise neg. free q    exp  I   exp  ln p  y,  ,    energy wrt. q =  q ( )  q     exp  I    exp  ln p  y,  ,    minimise divergence, by maximising  q ( )  variational energies Iterative updating of sufficient statistics of approx. posteriors by gradient ascent.K.E. Stephan 7
8. 8. Variational LaplaceAssumptions mean-field approximation Laplace approximationW. Penny 8
9. 9. Variational LaplaceInversion strategyRecall the relationship between the log model evidence and the negative free-energy 𝐹:ln 𝑝 𝑦 𝑚 = 𝐸 𝑞 ln 𝑝 𝑦 𝜃 − 𝐾𝐿 𝑞 𝜃 |𝑝 𝜃 𝑚 + 𝐾𝐿 𝑞 𝜃 ||𝑝 𝜃 𝑦, 𝑚 =: 𝐹 ≥0Maximizing 𝐹 implies two things:(i) we obtain a good approximation to ln 𝑝 𝑦 𝑚(ii) the KL divergence between 𝑞 𝜃 and 𝑝 𝜃 𝑦, 𝑚 becomes minimalPractically, we can maximize F by iteratively (EM) maximizing the variational energies: W. Penny 9
10. 10. Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Newton’s Method for finding a root (1D) f ( x(old ))x(new)  x(old )  f ( x(old ))Compute gradient vector 𝜕𝐼(𝜃) 𝑗𝜃 𝑖 = 𝜕𝜃(𝑖)Compute curvature matrix 𝜕 2 𝐼(𝜃) 𝐻𝜃 𝑖, 𝑗 = 𝜕𝜃(𝑖)𝜕𝜃(𝑗) 10
11. 11. Variational LaplaceImplementation: gradient-ascent scheme (Newton’s method)Compute Newton update (change)New estimate Newton’s Method for finding a root (1D) f ( x(old )) x(new)  x(old )  f ( x(old ))Big curvature -> small stepSmall curvature -> big stepW. Penny 11
12. 12. Newton’s method – demonstrationNewton’s method is very efficient. However, its solution is not insensitive to the startingpoint, as shown above.http://demonstrations.wolfram.com/LearningNewtonsMethod/ 12
13. 13. Variational Laplace Nonlinear regression (example) Model (likelihood): Ground truth (known parameter values that were used to generate the data Data: on the left): wherey(t) W. Penny t 13
14. 14. Variational LaplaceNonlinear regression (example)We begin by defining our prior: Prior density ( ) (ground truth)W. Penny 14
15. 15. Variational LaplaceNonlinear regression (example)Posterior density ( ) VL optimization (4 iterations) Starting point ( 2.9, 1.65 ) True value ( 3.4012, 2.0794 )W. Penny VL estimate ( 3.4, 2.1 ) 15
16. 16. 3 Sampling
17. 17. SamplingDeterministic approximations Stochastic approximations computationally efficient  asymptotically exact efficient representation  easily applicable general-purpose learning rules may give additional algorithms insight  computationally expensive application initially involves hard work  storage intensive systematic error 17
18. 18. Strategy 1 – Transformation method We can obtain samples from some distribution 𝑝 𝑧 by first sampling from the uniform distribution and then transforming these samples. 1.4 2  yields high-quality 1.2 samples 1 1.5  easy to implement 0.8  computationally p(z)p(u) 1 efficient 0.6  obtaining the inverse 0.4 0.5 cdf can be difficult 0.2 𝑢 = 0.9 𝑧 = 1.15 0 0 0 0.5 1 0 1 2 3 4 transformation: 𝑧 𝜏 = 𝐹 −1 𝑢 𝜏 18
19. 19. Strategy 2 – Rejection methodWhen the transformation method cannot be applied, we can resort to a moregeneral method called rejection sampling. Here, we draw random numbers from asimpler proposal distribution 𝑞(𝑧) and keep only some of these samples. The proposal distribution 𝑞(𝑧), scaled by a factor 𝑘, represents an envelope of 𝑝(𝑧).  yields high-quality samples  easy to implement  can be computationally efficient  computationally inefficient if proposal isBishop (2007) PRML a poor approximation 19
20. 20. Strategy 3 – Gibbs samplingOften the joint distribution of several random variables is unavailable, whereas thefull-conditional distributions are available. In this case, we can cycle over full-conditionals to obtain samples from the joint distribution.  easy to implement  samples are serially correlated  the full-conditions may not be availableBishop (2007) PRML 20
21. 21. Strategy 4 – Markov Chain Monte Carlo (MCMC)Idea: we can sample from a large class of distributions and overcome the problemsthat previous methods face in high dimensions using a framework called MarkovChain Monte Carlo. p(z*)Increase in density: p(zτ)  1 zτ z* p(zτ)Decrease in density: ~ ( z*) p p(z*) ~  p( z )M. Dümcke zτ z* 21
22. 22. MCMC demonstration: finding a good proposal densityWhen the proposal distribution is too When it is too wide, we obtain longnarrow, we might miss a mode. constant stretches without an acceptance.http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/ 22
23. 23. MCMC for DCMMH creates as series of random points 𝜃 1 , 𝜃 2 , … , whosedistribution converges to the target distribution of interest. For us, thisis the posterior density 𝑝(𝜃|𝑦).We could use the following proposal distribution: 𝑞 𝜃 𝜏 𝜃 𝜏−1 = 𝒩 0, 𝜎𝐶 𝜃 prior covariance scaling factor adapted such that acceptance rate is between 20% and 40%W. Penny 23
24. 24. MCMC for DCMW. Penny 24
25. 25. MCMC – exampleW. Penny 25
26. 26. MCMC – exampleW. Penny 26
27. 27. MCMC – exampleMetropolis-Hastings Variational-Laplace W. Penny 27
28. 28. 4 Model comparison
29. 29. Model evidenceW. Penny 29
30. 30. Prior arithmetic meanW. Penny 30
31. 31. Posterior harmonic meanW. Penny 31
32. 32. Savage-Dickey ratioIn many situations we wish to compare models that are nested. For example: 𝑚 𝐹 : full model with parameters 𝜃 = 𝜃1 , 𝜃2 𝑚 𝑅 : reduced model with 𝜃 = (𝜃1 , 0)In this case, we can use the Savage-Dickey ratio to obtain a Bayes factorwithout having to compute the twomodel evidences: 𝐵 𝑅𝐹 = 0.9 𝑝 𝜃2 = 0 𝑦, 𝑚 𝐹 𝐵 𝑅𝐹 = 𝑝 𝜃2 = 0 𝑚 𝐹W. Penny 32
33. 33. Comparison of methods MCMC VLChumbley et al. (2007) NeuroImage 33
34. 34. Comparison of methods estimated minus true log BFChumbley et al. (2007) NeuroImage 34