Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Bayesian Search for the Needle in the Haystack


Published on

Master project by Timothée Stumpf-Fétizon. Barcelona GSE Master's Degree in Data Science

Published in: Economy & Finance
  • Be the first to comment

  • Be the first to like this

A Bayesian Search for the Needle in the Haystack

  1. 1. A Bayesian Search for the Needle in the Haystack Timoth´ee Stumpf F´etizon [] July 23, 2015 Barcelona GSE
  2. 2. Introduction - Abstract Bayesian Model Averaging is a technique that systematically searches a model space (e.g. linear regression models) for promising models. It estimates the coefficients as weighted averages of all models, depending on how likely they are given the data. Estimates will be close to the ones you would obtain from fitting the ”true” nested model, and no knowledge of that model is required. Implementing the technique in high dimensions is computationally challenging, and I propose an improvement to the state-of-the-art algorithm. 2
  3. 3. Motivation
  4. 4. Motivation - Monte Carlo Methods • Many inference problems in statistics are intractable analytically (integration, massive discrete sample spaces, ...). • We can approximate many quantities by sampling from the distribution and computing sample statistics. This is the Monte Carlo approach. • The problem may be as fundamental as computing an expected value: x f (x) d x Even if the integral does not have an analytical solution, we can draw from f (x) and compute its sample mean! The more we draw, the better, so we need efficient algorithms. 4
  5. 5. Motivation - Approximating a Distribution • Consider the problem of choosing a hypothesis Hk from space H. • The Bayesian solution is, as always, seductively simple (but only on the face of it). Compute the posterior given the data X! π(Hk |X) = π(X|Hk ) π(Hk ) π(X) • H may be too large to compute all probabilities. But if most of the probabilities are very close to zero, we do not need to! Instead, we draw from π(H|X) and compute empirical frequencies. 5
  6. 6. Motivation - Selecting Models • Specific setting: selecting a linear regression model. As always, y = Xβ + • Given a design matrix X with d predictors, there are 2d possible models. Don’t even try to look at all of them if d > 20. • But if d is large, we definitely need an algorithmic model selection method, like the one above. This is a problem we want to solve! • Monte Carlo is the way to go here. 6
  7. 7. Approach
  8. 8. Approach - Markov Chain Sampling H(0) Q(H|H(0) ) −−−−−−→ H(1) Q(H|H(1) ) −−−−−−→ · · · • Markov Chain Monte Carlo is an umbrella term for algorithms that draw dependent samples from any distribution. • The samples are drawn from a Markov process - the current sample H(k) only depends on the last sample through the distribution Q H|H(k−1) . • Less dependence is better because it slows down the convergence of sample statistics. We can reduce dependence by setting Q(·) appropriately. 8
  9. 9. Approach - Drawing Models • In our specific setting, we draw linear regression models from the posterior π(H|X, y). A model H ∈ {0, 1} d is defined by the subset of variables in X it includes, so Hi is 1 if the i-th variable is included and 0 otherwise. • The standard version of Q(·) flips a random element of H. Hence, a variable was previously included, it will be excluded, and vice versa. This rule is symmetric: a move from H(k) to H(k+1) is as likely as a move in the opposite direction. • It also privileges intermediate model sizes. If more variables are excluded than included, flipping a random bit is more likely to include a new variable. • This is inappropriate if the model size should be small or large according to our prior expectation of the model size. 9
  10. 10. Approach - Model Size Priors Figure 1: This is the binomial prior on the model size. It peaks around its expected value, which is given by the prior probability p that any single variable is included times the total number of variables d. In this plot, d is 20 and p is 1/4 (green), 1/2 (red) and 3/4 (blue). 10
  11. 11. Approach - Setting Q(·) • We should set Q(·) such that it is consistent with that (or any) prior. • One way of doing that is to set asymmetric probabilities of growing and shrinking the model. If the prior is given by π(dH ), where dH is the model size, I propose to increase the model size with probability Pr growth|d (k) H = π d (k) H + 1 π d (k) H + 1 + π d (k) H − 1 • This discourages the proposal of models that are much larger than the prior expectation. 11
  12. 12. Approach - Prior-consistent Q(·) Figure 2: The curves represent the probability of growing the model for different choices of Q(·). The gray line corresponds to the standard symmetric rule, the colored lines to my asymmetric rule. In this plot, d is 20 and p is 1/4 (green), 1/2 (red) and 3/4 (blue). 12
  13. 13. Approach - Autocorrelation • Sample dependence is the performance criterion for an MCMC algorithm. Less dependence means we’re getting more out of every draw. • We measure dependence by way of an inclusion indicator’s autocorrelation function (ACF). • I test the rule with a 20-variable simulation where a fourth of the variables is relevant. Thus, I set p = 1/4, which corresponds to the green prior. 13
  14. 14. Approach - Simulations Figure 3: The curves represent ACFs for the 20 variable inclusion indicators for the standard (blue) and the modified rule (red). Most curves decay much more quickly when using the modified rule, which translates to more efficient sampling. 14
  15. 15. Final Remarks • In the case above, my effective sample size increased up to 4 times the previous amount. • This is work in progress and there is no telling whether the rule works better in all situations! • If you’re interested in using BMA in practice, you can fork the software on my github (working knowledge of Python required!): 15