MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Variable Selection Priors and Applications to Large-scale Data (revised), Marina Vannucci, April 30, 2019

Introduction Variable Selection Priors Edge Selection Beyond Gaussian Data Variational Inference Conclusions
Spatially Informed Variable Selection Priors and
Applications to Large-Scale Data
Marina Vannucci
Department of Statistics
Rice University
Houston, TX
USA
April 2019 - SAMSI

Outline of the Talk
Variable selection via spike-and-slab priors
Accounting for spatial information
Nonparametric constructions
Edge selection in graphical models
Single graphs
Multiple graphs
Beyond Gaussian data - count data
Computational issues - Variational Inference
Motivating applications: High-throughput genomics;
neuroimaging.

Spike-and-slab Variable Selection Prior for Regression Models
Yn×1 = Xn×pβp×1 + ε, ε ∼ N(0, σ2
I)
Introduce latent variable γ = (γ1, . . . , γp) to select variables
γj = 1 if variable j included in model
γj = 0 otherwise
Specify priors for model parameters:
βj ∼ (1 − γj)δ0 + γjN(0, σ2
τ2
j ), σ2
∼ IG(ν/2, λ/2)
p(γ) =
p
j=1
ωγj (1 − ω)1−γj
Combine data and prior information into a posterior distribution
⇒ interest in posterior distribution p(γ|X, Y).

Posterior Inference via MCMC
Use Metropolis as stochastic search. At each iteration
(i) Add or Delete one variable in γold
(ii) Swap a 0 and a 1 in γold
The proposed γnew is accepted with probability
min
p(γnew |X, Y)
p(γold |X, Y)
, 1 .
George & McCulloch (1993,1997); Brown et al. (1998 & 2002)
Sample (β, γ) jointly if marginalization not possible
Gottardo & Raftery (2008 JCGS); Savitsky et al. (2011 Stat Science)
Select variables in the “best” models or from largest
marginal posterior probabilities of inclusion (PPIs) via
median model or Bayesian FDR thresholds.

Functional Magnetic Resonance Imaging (fMRI)
Source:(www.anc.ed.ac.uk((
Source:(photos.uc.wisc.edu((
Indirect measure of brain activity as changes in blood ﬂow.
Collected at rest or during a sensorimotor task.
Observed data - time series of the blood oxygenation level
dependent (BOLD) response, at each voxel in the brain.

Model for Single Subject
Yν = Xνβν + εν, ν = 1, . . . , V
fMRI = BOLD Signal + Noise
Yν = (Yν1, . . . , YνT )T time-series data for voxel ν.
Xν the convolved design matrix T × p (p stimuli).
Convolution of stimulus function with HRF (e.g., Poisson
with parameter λν) to account for time lapse between
stimulus and vascular response.
βν = (βν1, . . . , βνp)T regression coefﬁcients vector
εν ∼ NT (0, Σν) caused by hardware and subjects

Spatial Modeling
Spike-and-slab prior on to select activated voxels (p = 1)
βν ∼ γνN(0, τ) + (1 − γν)δ0, ν = 1, . . . , V,
MRF prior on voxel-dependent parameter γν
(Zhang et al., 2014 NeuroImage)
P(γν|d, e, γk , k ∈ Nν) ∝ exp(γν(d + e
k∈Nν
γν))
with Nν the set of neighbors of voxel ν, d ∈ R, e > 0.
Probit priors
(Chiang et al., 2017 Human Brain Mapping)
p(γν = 1) = Φ (α0 + α1Nν)
with Nν a set of covariates (e.g., multi-modal image data).

Spatial Modeling - Nonparametric Prior
Objective: capture correlation among time series data of
possibly distant voxels
Proposed prior: spike-and-slab nonparametric prior,
βν|γν, G0 ∼ γνG0 + (1 − γν)δ0
with G0 a Dirichlet process prior with N(0, τ) as centering
distribution.
Achieves clustering of voxels within subject according to
similar activation strengths.
Particularly useful for multi-subject modeling

Regression Model for Multiple Subjects
Extend the single-subject regression model to multiple subjects
Yiν = Xiνβiν + εiν, εiν ∼ NT (0, Σiν)
Yiν = (Yiν1, . . . , YiνT )T , T × 1 BOLD response data for the
νth voxel in the ith subject
Xiν, T × p design matrix, with Poisson HRF
βiν, p × 1 vector of regression coefﬁcients (spatial prior)
Hierarchical Dirichlet process as slab distribution on βiν

Spatial Nonparametric Prior
Objective: capture correlation within and across subjects.
Proposed prior: spike-and-slab nonparametric prior,
βiν|γiν, Gi ∼ γiνGi + (1 − γiν)δ0
Clustering of subjects based on similar activation patterns;
clustering of voxels within each subject according to similar
activation strengths.
Subject(1(
Subject(3(
Subject(4( Subject(2(
Subject(5(
(Zhang et al., 2016 Annals of Applied Statistics)

Posterior Inference
Standard approach: MCMC
MH updates to jointly update (β, γ) with a combination of
add-delete-swap steps and sampling-based methods for
DP models (based on the Chinese restaurant franchise)
Update λ with Metropolis-Hastings (MH) schemes
Update other model parameters
Faster strategy: Variational Inference (Matlab GUI)
Combine VI schemes with importance sampling steps on
other model parameters (Carbonetto and Stephens, 2012)
Posterior maps on γiν, βiν, λiν’s

Software: NPBayes-fMRI GUI (https://github.com/marinavannucci)
500
520
540
560
580
600
620
640
660
680
700
Cluster 1 (n = 12) Cluster 2 (n = 16)
RTtoCorrectTrials(ms)
Nonwords Pseudohomophones
*
**
* p < .05, ** p < .001 -30
-20
-10
0
10
20
30
40
50
60
PseudohomophoneEffect(ms)
Cluster 1 Cluster 2
Fischer-Baum et al. (2018). Individual differences in the neural and cognitive
mechanisms of single word reading. Frontiers in Human Neuroscience, 12:271.
(Kook et al., 2019 Statistics in Biosciences)

Edge Selection in Graphical Models
Undirected Graphical Models
Use a graph structure to
summarize conditional
independence between variables
Each node represents a variable
No edge exists between two
nodes if and only if variables are
independent given all others
Fundamental assumption
True conditional dependence structure is reasonably
sparse

Gaussian Graphical Model (GGM)
Data follow a joint multivariate normal distribution
xl ∼ N(µ, Ω−1
) l = 1 . . . n
ωi,j = 0 if no edge exists
between nodes i and j in
graph G
For simplicity, we
column-center the data to
assume µ = 0

Choices for Edge Selection Priors
G-Wishart priors enforce exact zeros but are limited in
scalability (Roverato 2002, Dobra et al. 2011)
Continuous shrinkage prior of Wang (BA 2015)
p(Ωk |G, v0, v1, λ) ∝
i<j
N(ωi,j|0, v2
gij
)
i
Exp(ωi,i|
λ
2
)
defined by introducing binary latent variables of
edge-inclusion G ≡ (gi,j)i<j
∈ G ≡ {0, 1}p(p−1)/2
Avoid use of priors over the space of positive definite
matrices with fixed zeroes
Improve computational efficiency

Bayesian Framework for Multiple Graphs Estimation
xk,l ∼ N(µk , Ω−1
k ) l = 1, . . . , nk k = 1, . . . , K
Graphical model for each group over common set of
variables
xk,l, l = 1, . . . , nk data for subject l in group k
Ωk = Σ−1
k = (ωi,j,k ) is the precision matrix for group k
May share common features, or could differ substantially

Linking Networks with a MRF Prior
Encourages selection of common edges for similar groups
p(gij|νij, Θ) =
exp(νij 1T gij+gij
T Θgij)
C(νij ,Θ)
with gij = {g1ij, . . . , gkij}
Spike-and-slab prior on
off-diagonal elements of
K × K matrix Θ
Nonzero entries indicate
network relatedness
Magnitude measures
pairwise graph similarity
Flexibility of model
Shares information when appropriate
Does not enforce similarity unless supported by data

Selection Prior for Network Similarity
Spike and slab prior on off-diagonal entries of Θ
p(θkm|γkm) = (1 − γkm)δ0 + γkm
βα
Γ(α)
θα−1
km exp(−βθkm)
Spike-and-slab prior on parameters measuring network
relatedness, to allow shared structure to be favored only when
supported by the data
Fixed hyperparameters α and β
Latent indicator variable γkm of graph relatedness
Independent Bernoulli priors deﬁned on latent indicators
with probability w ∈ [0, 1]

Completing the Model
Edge speciﬁc prior for edge inclusion probability
P(νij) =
1
β(a, b)
exp(aνij)
(1 + exp(νij))a+b
Can encourage sparsity of the graphs G1, . . . , GK
Can be used to incorporate prior knowledge of connections
between nodes

Diagram of the Joint Graph Model
Peterson et al. (2015, JASA); Shaddox et al. (2019, Biostatistics)

Posterior Inference
MCMC Sampling Scheme
At each iteration t of our algorithm, sample from posterior full
conditionals
(i) Update graph G
(t)
k and precision matrix Ω
(t)
k for
k = 1, . . . , K
Block Gibbs sampler
p-coupled stochastic search variable selection algorithm
(ii) Update network relatedness parameters θkm and γkm for
1 ≤ k < m ≤ K using Metropolis-Hastings steps
Incorporates both between-model and within-model moves
(iii) Update νij for 1 ≤ i < j ≤ p using a standard
Metropolis-Hastings step
Model Selection
Posterior marginal probabilities of edge inclusion (PPIs)

[Control]
1
2
3
4
5
6
7
8
9
10
1112
13
14
15
16
17
18
19
20
21
22
23 24
25
[Mild]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 24
25
[Moderate]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[Severe]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 17
18
19
20
21
22
23 24
25
p = 100 TPR(SE) FPR (SE) MCC(SE) AUC(SE)
Fused GLasso 0.8658 (.0479) 0.2620 (.2632) 0.2775(.0940) 0.9688(.0016)
Group GLasso 0.8496 (.0095) 0.1739 (.0028) 0.3089 (.0042) 0.8982(.0048)
Our Method 0.7135 (.2929) 0.0053 (.0037) 0.7453 (.2334) 0.9578 (.0434)
ΘM =


· .95 .97
· .95
·


Pathway Platform Total Pairs 100 110 011 001 Total Disrupted
FcγR Metabolites 73 17 5 3 18 43
(Shaddox et al., 2019 Biostatistics)

Beyond Gaussian Data
Linear models
Earlier work in GLM/GLMM models (Raftery 1996; Chen &
Dunson, 2003; Kinney & Dunson, 2007)
Data augmentation techniques for logit and log links
(Polson et al. (2013) and Zhu et al. (2012))
Recent interest in (zero-inﬂated) negative binomial and
Dirichlet-multinomial regression models for large-scale
genomic data (e.g. microbiome)
(Wadsworth et al. 2017, Koslovsky and Vannucci 2019)

Dirichlet-Multinomial Regression
Multivariate counts modeled as Multinomial(y+, φ), with
y+ = q
j=1 yj, e.g. q microbial taxa, and φ ∼ Dirichlet(γ)
Integrate p covariates as log(γj(xi)) = xiθj
Obtain hierarchical Dirichlet-multinomial log-linear
regression
Yi | φi
ind.
∼ Multinomial(y+, φi)
φi | α, β1, . . . , βp
ind.
∼ Dirichlet

 exp(αj +
p
k=1
βkjxik ))
j=1,...,q


with φi taxa-speciﬁc proportions.

Variable Selection Prior
Spike-and-slab selects relevant taxa/predictor associations
βkj | ξkj ∼ ξkj Normal(0 | t2
kj) + (1 − ξkj)δ0(βkj)
Stochastic search MCMC for posterior inference
1 Draw auxiliary variables for DM(γj (x))
2 Update α using Metropolis–Hastings
3 Update (ξ, β) jointly via SSVB MCMC (Savitsky et al., 2011)
Positive Associations (MPPI > 0.981)
ko00920
ko00513
ko00260
ko04660
ko00626
ko00620
ko01057
ko04512
ko04340
ko00330
f.Prevotellaceae
g.Parabacteroides
g.Prevotella
o.Bacteroidales
g.Roseburia
g.Subdoligranulum
g.Bacteroides
g.Ruminococcus
Negative Associations (MPPI > 0.981)
ko00680
ko00512
ko00904
ko00500
ko00620
ko00010
ko00534
ko05200
ko00053
ko04540
ko00650
ko00983
g.Sutterella
g.Prevotella
g.Dialister
g.Parabacteroides
g.Bacteroides
g.Ruminococcus
o.Bacteroidales
Wadsworth et al. (2017, BMC Bioinformatics); Koslovsky and Vannucci
(2019, submitted) for Dirichlet-tree multinomial distributions

Variational Inference
Variational inference turns inference into an optimization
problem. Faster and more scalable than MCMC.
Underlying idea: pick family of distributions q (β, γ) ∈ Q,
with free variational parameters; use gradient descent to
minimize KL divergence between q and posterior f (β, γ)
q∗
= arg min
q∈Q
KL (q f) = q (β, γ) log
q (β, γ)
f (β, γ)
dβdγ
= log p (y | X) − ELBO

Mean Field VI
Fully factorized approximation to reduce complexity
q (β, γ | θ) =
p
k=1
q (βk , γk ; θk )
Choose approximating distributions from same family as
prior distributions, to exploit conjugacy.
Combine with importance sampling schemes on other
model parameters (Zhang et al., 2016 AOAS)
Extend to GLM models with logit and log links by
incorporating data augmentation schemes of Polson et al.
(2013) and Zhu et al. (2012) for logistic and Negative
Binomial regression models.

Summary and Conclusions
Spike-and-slab priors for variable selection are well suited
for applications.
Flexible structure for the incorporation of external
information via MRF or probit priors or spiked DP
constructions.
Priors for edge selection in Gaussian graphical models.
Methodologies can be extended beyond Gaussian data.
Variational inference approximations for scalability.
Neuroimaging and genomics as emerging areas for
applications.
Improved performance over competitive approaches.

THANKS!
Eric Kook, PhD student, Rice University.
Yinsen Miao, PhD student, Rice University.
Nathan Osborne, PhD student, Rice University.
Matthew Koslovsky, postdoc, Rice University.
Elin Shaddox, PhD 2019, postdoc at UC Denver
Qiwei Li, Assistant Professor, UT Dallas.
Duncan Wadsworth, Data Scientist, Microsoft, Seattle.
Terrance Savitsky, Bureau of Labor Statistics, D.C.
Alberto Cassese, Assistant Professor, Maastricht, The Netherland.
Christine Peterson, Assistant Professor, MD Anderson.
Francesco Stingo, Assistant professor, U. Florence, Italy.
Michele Guindani, Professor, UC Irvine.
HAMILL INNOVATION AWARD

MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Variable Selection Priors and Applications to Large-scale Data (revised), Marina Vannucci, April 30, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Variable Selection Priors and Applications to Large-scale Data (revised), Marina Vannucci, April 30, 2019

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Variable Selection Priors and Applications to Large-scale Data (revised), Marina Vannucci, April 30, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Variable Selection Priors and Applications to Large-scale Data (revised), Marina Vannucci, April 30, 2019