MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019

Statistical sparsity and Bayes factors
Peter McCullagh
Department of Statistics, University of Chicago
SAMSI, Durham NC, April 2019
Joint with N. Polson (Bka, 2018) and M. Tresoldi

Outline
Univariate sparsity
Signal plus noise model Y = X + ε
Sparseness: examples
Sparseness: definition as a limit
Sparseness: Cauchy-type exceedance measures
Marginal density of Y
Tail inflation and α-stable measures
Tweedie formulae
Exceedance probability and Bayes factors
Vector sparsity
Definition of rates and measures
Application to regression
Application to contingency tables
Sparse processes and subset selection

Signal plus noise model (J&S 2004)
n sites: i = 1, . . . , n; (n = 1 sufﬁces here)
Signals Xi ∼ P (sparse but iid)
Gaussian errors εi ∼ N(0, 1) iid
Observations: Yi = Xi + εi (also iid)
Inferential targets:
m(y) = φ(y − x) P(dx)
P(X ∈ dx | Y = y) = φ(y − x)P(dx)/m(y)
E(X | Y = y) = ??? how much shrinkage?
P(Xi = 0 | Y = y) = local false positive rate
References:
Johnstone & Silverman (2004); Efron (2008; 2009; 2010; 2011);
Benjamini & Hochberg (1995)

Eight examples of statistical sparsity
F arbitrary symmetric; γ > 0 arbitrary const
(I) Atom and F-slab: (1 − ν)δ0 + νF J&S (2004); Efron. (2008)
(ii) G-Spike and F-slab: (1 − ν)N(0, γν2) + νF R&G (2018)
(iii) L-Spike and F-slab: (1 − ν)L(γν) + νF G&McC (1993)
(iv) C-Spike and F-slab: (1 − ν)C(γν) + νF
(v) Double gamma: 1
2 |x|ν−1e−|γx|/Γ(ν) G&B (2013)
(vi) Sparse Cauchy: C(ν); density ν π−1/(ν2 + x2)
(vii) Sparse horseshoe: log(1 + ν2/x2)/(2πν) CPS (2010)
(viii) Sparse F: |x|ν−1 sin(πν/2)π−1/(1 + x2)
Mixture fraction ν is small: limν→0 Pν = δ0

Sparsity definition I: sparse limit
Defn I: A family of symmetric distributions {Pν} on R has a
sparse limit as ν → 0 if there exists
(i) a rate parameter ρν → 0 as ν → 0;
(ii) an exceedance measure H(·) such that
lim
ν→0
ρ−1
ν Pν(|X| > ) = H( +
) < ∞
for every > 0;
(iii) H is a Lévy measure: 1 − e−x2/2 H(dx) = 1
(a) Defn satisfied by all examples in literature
(b) What are the implications?
(i) Equivalent families have the same H: e.g., (i)–(iii)
(ii) Non-identifiability of certain functionals
(iii) Sparse approx: Pν(|X| > ) = ρνH( +
) + o(ρν)
(iv) Sparse-limit approximations for Pν(X | Y)
(v) No big-data implications: n = 1 suffices

Sparsity II: Formal integral deﬁnition
Defn II: {Pν} has a sparse limit with rate ρν if there exists a
measure H such that
lim
ν→0
ρ−1
ν
R
w(x)Pν(dx) =
R
w(x)H(dx) < ∞
for every w in the space W...
Lévy-integrable functions: bounded continuous functions w(x)
such that x−2w(x) is also bd and cts.
e.g., min(x2, 1); x2e−x2
; 1 − e−x2
; (cosh(tx) − 1)e−x2
Implication: sparse approximations are restricted to functions in W!
Defn III: Unit measure: (1 − e−x2/2)H(dx) = 1
zeta function: ζ(t) = (cosh(tx) − 1)e−x2/2 H(dx)

Why deﬁne sparsity as a limit?
(i) In practice, ρ is small, say ρ < 0.05; so also is ν
(ii) But ν is an arbitrary parameterization, whereas ρ is not
(iii) Two families having the same H are ﬁrst-order equivalent
(iv) (1 − e−x2/2)H(dx) = 1 implies H(|X| > 1) 1
Pν(|X| > 1) = ρH(1+
) + o(ρ) ρ
φ(x)ζ(x) dx = 1
(v) the limit allows us to develop approximations such as
Pν(|X| > 1 | Y) =
ρζ(y)
1 + ρζ(y)
+ o(1)
that are based on sparsity rather than sample size.

Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)

Sparse families of Cauchy type
Sparse Cauchy: X ∼ Cauchy(σ = ν)
σ−1
Pσ(X ∈ dx) =
dx
π(σ2 + x2)
→
dx
πx2
lim
σ→0
σ−1
Pσ(|X| > ) =
∞
2 dx
πx2
=
2
π
Sparse horseshoe: Pσ(dx) = log(1 + σ2/x2) dx/(2πσ)
lim
σ→0
σ−1
Pσ(dx) →
dx
2πx2
H(dx) = dx/(x2
√
2π) inverse-square unit measure on R
Rates: Cauchy: ρν = σ π/2; Horseshoe: ρν = σ
√
2π

Double gamma family
Double gamma density (Grifﬁn & Brown 2013):
2pν(x) =
|x|ν−1 exp(−|x|)
Γ(ν)
ν|x|ν−1
exp(−|x|)
Unit exceedance density: h(x) = K−1|x|−1 exp(−|x|)/2
Normalization const: K = (1 − e−x2/2) |x|−1e−|x| dx/2
Sparsity rate: ρν = K−1ν 3.75ν
Not ﬁnite, but the activity index is. . .
AI(H) = inf α > 0 :
1
−1
|x|α
H(dx) < ∞ = 0

Marginal density: sparse limit approximation
Sparse signal plus Gaussian noise model: Y = X + ε
mν(y) =
R
φ(y − x) Pν(dx)
... details in handout
...
= (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
ζ(y) =
R
cosh(yx) − 1 e−x2/2
H(dx); ζ(0) = 0
The product ψ(y) = φ(y)ζ(y) is a probability density!
...symmetric and bimodal

Inverse-power α-stable exceedance measures
H(dx) ∝ dx/(|x|α+1); (0 < α < 2)
Prob density ψ(y) = φ(y)ζ(y) (inverse-power tail)
-2 0 2 4 6
0.000.050.100.150.200.250.30
Inverse-power psi densities
d=2.0
d=1.5
d=1.0
d=0.5
d=0.1
inverse-square exceedance
Tail inﬂated densities ψα(y) for inverse-power measures.

Tail inﬂation: inverse-square versus Gaussian
-2 2 4 6
0.05
0.10
0.15
0.20
0.25
Tail inﬂated densities φ(x)ζ(x) for Gaussian and inverse square

Five implications for inference
(i) Asymptotic marginal density of X + ε is a mixture
(φ Pν)(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + O(ρ2
)
(ii) Two families having the same H are indistinguishable!
e.g., (1 − ν)δ0 + νF and (1 − ν)N(0, ν2) + νF
(iii) the rate parameter is identiﬁable: mν(0)/φ(0) = 1 − ρ
(iv) the null atom Pν({0}) is not identiﬁable
(v) If Pν = (1 − ν)δ0 + νF is a sparse spike-F mixture,
... the mixture fractions in Pν and mν are not equal!
ρ = ν (1 − e−x2/2
)F(dx) < ν

Tweedie’s formula for conditional moments
The conditional mgf is
E(etX
| Y) =
m(y + t) φ(y)
φ(y + t) m(y)
=
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
+ o(ρ)
Bayes estimate of the signal:
E(X | Y) = −
d
dy
log
m(y)
φ(y)
=
ρζ (y)
1 − ρ + ρζ(y)
+ o(ρ)
Depends only on the exceedance measure (& rate)

Bayes estimate of signal
H(dx) = dx/(x2
√
2π) E(X | Y) =
ρζ (y)
1 − ρ + ρζ(y)
-5 0 5
-505
Conditional expected value of signal
nu = 0.25
nu = 0.00024
E(signal | Y=y) versus y
at sparsity levels nu = 1/4^k

Conditional density of signal
Cauchy prior ρ = 0.02; y = 3.5; ζ(y) = 55.3
-1 0 1 2 3 4 5 6
0.000.050.100.150.200.25
Conditional density of signal: Cauchy signal
x
cdens
x
y = 3.5
rho = 0.02
zeta(y) = 55.3

Conditional activity probability
Double limit: ρ → 0, |y| → ∞ such that ρζ(y) = λ > 0
DL condition (DLC): lim
y→∞
log ζ(y)
y2
=
1
2
Under the DL condition
lim
ρζ(y)=λ
Pν(|X| > | Y) =
ρζ(y)
1 + ρζ(y)
lim
ρζ(y)=λ
odds(|X| > | Y) = ρζ(y) = λ
for every ﬁxed threshold > 0
— reasonable thresholds: 0.4 ≤ ≤ 0.8 for ρ 0.01
— DLC fails if H has Gaussian or sub-Gaussian tails
e.g., Pν = (1 − ν)δ0 + νN(0, 5)

Sparse Bayes factor for signal activity
Fix a threshold > 0
Signal activity event: + = {X : |X| > }
Pν(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > | Y) = ρζ(y) + o(1)
Bayes factor for +-activity:
BF( +
; y) =
odds( + | Y)
odds( +)
=
ζ(y)
H( +)
— can choose 0.8 so that H( +) = 1.

Vector sparsity in Rd
Essentially identical with Pν spherically symmetric
Standardization: Rd (1 − e− x 2/2) H(dx) = 1
coshd (y) =
Sd
e y, u
U(du)
ζd (t) =
Rd
coshd (tx) − 1 e− x 2/2
H(dx)
mν(y) = (1 − ρ)φd (y) + ρφd (y)ζd ( y ) + o(ρ)
Under the double limit condition...
Pν( X > | Y) =
ρζd ( y )
1 + ρζd ( y )
+ o(1)
BF( +
; y) =
odds( X > | Y)
odds( X > )
=
ζd ( y )
H( +)

Application of vector sparsity to regression
The space Rn is Euclidean with standard inner product
Given an initial subspace: X0 ⊂ Rn
a subspace extension X = span{x1, . . . , xd } ⊂ X⊥
0
and an observation Y ∼ N(µ0 + Xβ, In)
with coefﬁcient vector β ∼ Pν vector-sparse in Rd
What is the Bayes factor for the event β > ?
Mathematical assumptions:
Pν requires an inner product in Rd
so we make the parameter space Euclidean by assumption
β → Xβ is an isometry Rd → Rn (FI metric)
—not component-wise sparsity!
Conclusions under DLC:
odds( β > | Y) = ρ ζd ( PX y )
BF( +; y) = ζd ( PX y )/H( +)

Ten remarks on vector sparsity in regression
(i) limit based on vector sparsity alone ν → 0
(ii) no need for large samples: n = d sufﬁces
(iii) Choice of basis vectors is immaterial: (no orthogonality)
(iv) mν(y) = (1 − ρ)φn(y − µ0) + ρφn(y − µ0)ζd ( PX y ) + o(ρ)
(v) BF( +; y) is a function of the regression SS (not of n)
(vi) Dependence on threshold: BF ∝ in the simplest case
(vii) Connection with BIC, if any, is unclear
(viii) assumes σ2 = 1
(ix) vector sparsity is not component-wise sparsity
(x) component-wise sparsity is basis-dependent

Vector sparsity in contingency tables
Setting: a contingency table with observations Yij ∼ Po(µij)
subspaces X0 = row + col; X = X⊥
0
β ∈ X sparse in the natural Poisson-induced metric
PX y 2
=
cells
(obsij − ﬁtij)2
ﬁtij
odds( β > | Y) = ρζd ( PX y )
Table 1: Bayes factor ζd ( PX y ) for chi-squared at three percentiles
Tail Degrees of freedom
prob 1 2 3 4 5 6 7 8 9 10 11 12
0.05 2.9 2.6 2.4 2.3 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9
0.01 7.7 6.4 5.7 5.3 5.0 4.8 4.6 4.5 4.4 4.3 4.2 4.1
0.001 32.7 26.7 23.8 21.9 20.6 19.6 18.8 18.1 17.5 17.0 16.6 16.2

Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)

Sparse process and subset selection (contd)
If we want a conditional distribution on subsets b ⊂ [n]
...we need a process with masses on subspaces Vb
b ⊂ [n] → Vb ⊂ Rn
; HQ
n (Vb) > 0; Pν(X ∈ Vb | Y) =???
1. Singular exceedance process...
HQ
n (dx) =
b⊂[n];b=∅
Qn(b) Hb(dx[b]) δ0(dx[¯b])
KQ
n =
Rn
(1 − e− x 2/2
)HQ
n (dx) =
b⊂[n];b=∅
Qn(b)Kb.
2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}).
e.g. Qn(b) = λ
n + λ
#b
−1
λ
#b

Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )

Numerical illustration
n = 3 y = (1.5, 0.5, 2.5); ρ = 0.01;
Conditional intensity ρζ1(y) = (0.105, 0.000, 0.553)
Activity prob:
ρζ1(y)
1 − ρ + ρζ1(y)
= (0.10, 0.000, 0.36)
Independence (λ → ∞):
P(sites 1,3 exclusively active) = 0.036
For λ = 10 (non-independence) Qn(b) =
λ
#b
n + λ
#b
−1
excl active sites ∅ 1 2 3 12 13 23 123
P(·) 0.48 0.051 0.000 0.269 0.003 0.173 0.011 0.014
P(sites 1,3 active) ∝ Qn(2) K2 ζ2(
√
(y2
1 + y2
3 )) = 0.187

Summary: Role of a deﬁnition
Sparsity implies a characteristic pair (ρ, H)
(i) Equivalent families have the same H
(ii) Zeta function: ζ(y) = (cosh(yx) − 1)e− x 2/2H(dx)
(iii) Marginal density of Y = X + ε is
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y)
(iv) Conditional expectation E(X | Y) = ρζ (y)/(1 − ρ + ρζ(y))
(v) Conditional distribution of X
oddsν( X > | Y) = ρζ(y)
(vi) Subset selection requires a sparse process...
... either iid (conventional)
... or exchangeable and with singular H
(vii) Sparsity is neutral on the BFF spectrum

MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019

Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019