Priors for BNNs, based on:
"Wenzel et al.: What Are Bayesian Neural Network Posteriors Really Like?, 2020", "Noci et al.: Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect, 2021",
"Fortuin et al.: Bayesian Neural Network Priors Revisited, 2021" and
"Immer et al.: Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning, 2021"
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Priors for BNNs
1. Priors in Bayesian Neural
Networks
Tomasz Kuśmierczyk
2022-05-06
Based on:
Wenzel et al.: What Are Bayesian Neural Network Posteriors Really Like?, 2020
Noci et al.: Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect, 2021
Fortuin et al.: Bayesian Neural Network Priors Revisited, 2021
Immer et al.: Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning, 2021
8. trading off the relative influence between the prior term and the likelihood term:
→if the CPE becomes stronger as the relative influence of the prior increases,
this would be an indication that the prior is poor
Bad prior hypothesis
(data size)
Noci et al.: Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect, 2021
9. DNN (SGD trained; no prior) weights
Fortuin et al.: Bayesian Neural Network Priors Revisited, 2021
14. Marginal likelihood optimization / type-II MLE / empirical Bayes
Model selection (e.g. choosing priors) by maximizing log of ML:
● For a fixed model, can be estimated using Laplace approximation with
GGN for Hessian
→ Alternate between model updates and the approximation
Will the approximation capture difference in priors?
Immer et al.: Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning, 2021
15. ● Assume approximation from some (parametric) family of
distributions
● Maximize ELBO wrt its parameters λ
Will posterior capture difference in priors? How to learn so complex priors are
accounted for?
Posterior learning and learning priors: VI
16. ● MCMC e.g. SGHMC:
○ explores the space of parameters and generates set of samples { } from the posterior
○ assumes a fixed energy function, for example,
parametric priors cannot be learned, but, we can think about hierarchical priors:
(Nalisnick et al.: Predictive Complexity Priors, 2020): optimize KL divergence to predictive distribution of a reference model
for hierarchical prior
MCMC?