Statistical sparsity and Bayes factors
Peter McCullagh
Department of Statistics, University of Chicago
SAMSI, Durham NC, April 2019
Joint with N. Polson (Bka, 2018) and M. Tresoldi
Outline
Univariate sparsity
Signal plus noise model Y = X + ε
Sparseness: examples
Sparseness: definition as a limit
Sparseness: Cauchy-type exceedance measures
Marginal density of Y
Tail inflation and α-stable measures
Tweedie formulae
Exceedance probability and Bayes factors
Vector sparsity
Definition of rates and measures
Application to regression
Application to contingency tables
Sparse processes and subset selection
Signal plus noise model (J&S 2004)
n sites: i = 1, . . . , n; (n = 1 suffices here)
Signals Xi ∼ P (sparse but iid)
Gaussian errors εi ∼ N(0, 1) iid
Observations: Yi = Xi + εi (also iid)
Inferential targets:
m(y) = φ(y − x) P(dx)
P(X ∈ dx | Y = y) = φ(y − x)P(dx)/m(y)
E(X | Y = y) = ??? how much shrinkage?
P(Xi = 0 | Y = y) = local false positive rate
References:
Johnstone & Silverman (2004); Efron (2008; 2009; 2010; 2011);
Benjamini & Hochberg (1995)
Eight examples of statistical sparsity
F arbitrary symmetric; γ > 0 arbitrary const
(I) Atom and F-slab: (1 − ν)δ0 + νF J&S (2004); Efron. (2008)
(ii) G-Spike and F-slab: (1 − ν)N(0, γν2) + νF R&G (2018)
(iii) L-Spike and F-slab: (1 − ν)L(γν) + νF G&McC (1993)
(iv) C-Spike and F-slab: (1 − ν)C(γν) + νF
(v) Double gamma: 1
2 |x|ν−1e−|γx|/Γ(ν) G&B (2013)
(vi) Sparse Cauchy: C(ν); density ν π−1/(ν2 + x2)
(vii) Sparse horseshoe: log(1 + ν2/x2)/(2πν) CPS (2010)
(viii) Sparse F: |x|ν−1 sin(πν/2)π−1/(1 + x2)
Mixture fraction ν is small: limν→0 Pν = δ0
Sparsity definition I: sparse limit
Defn I: A family of symmetric distributions {Pν} on R has a
sparse limit as ν → 0 if there exists
(i) a rate parameter ρν → 0 as ν → 0;
(ii) an exceedance measure H(·) such that
lim
ν→0
ρ−1
ν Pν(|X| > ) = H( +
) < ∞
for every > 0;
(iii) H is a Lévy measure: 1 − e−x2/2 H(dx) = 1
(a) Defn satisfied by all examples in literature
(b) What are the implications?
(i) Equivalent families have the same H: e.g., (i)–(iii)
(ii) Non-identifiability of certain functionals
(iii) Sparse approx: Pν(|X| > ) = ρνH( +
) + o(ρν)
(iv) Sparse-limit approximations for Pν(X | Y)
(v) No big-data implications: n = 1 suffices
Sparsity II: Formal integral definition
Defn II: {Pν} has a sparse limit with rate ρν if there exists a
measure H such that
lim
ν→0
ρ−1
ν
R
w(x)Pν(dx) =
R
w(x)H(dx) < ∞
for every w in the space W...
Lévy-integrable functions: bounded continuous functions w(x)
such that x−2w(x) is also bd and cts.
e.g., min(x2, 1); x2e−x2
; 1 − e−x2
; (cosh(tx) − 1)e−x2
Implication: sparse approximations are restricted to functions in W!
Defn III: Unit measure: (1 − e−x2/2)H(dx) = 1
zeta function: ζ(t) = (cosh(tx) − 1)e−x2/2 H(dx)
Why define sparsity as a limit?
(i) In practice, ρ is small, say ρ < 0.05; so also is ν
(ii) But ν is an arbitrary parameterization, whereas ρ is not
(iii) Two families having the same H are first-order equivalent
(iv) (1 − e−x2/2)H(dx) = 1 implies H(|X| > 1) 1
Pν(|X| > 1) = ρH(1+
) + o(ρ) ρ
φ(x)ζ(x) dx = 1
(v) the limit allows us to develop approximations such as
Pν(|X| > 1 | Y) =
ρζ(y)
1 + ρζ(y)
+ o(1)
that are based on sparsity rather than sample size.
Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
Sparse families of Cauchy type
Sparse Cauchy: X ∼ Cauchy(σ = ν)
σ−1
Pσ(X ∈ dx) =
dx
π(σ2 + x2)
→
dx
πx2
lim
σ→0
σ−1
Pσ(|X| > ) =
∞
2 dx
πx2
=
2
π
Sparse horseshoe: Pσ(dx) = log(1 + σ2/x2) dx/(2πσ)
lim
σ→0
σ−1
Pσ(dx) →
dx
2πx2
H(dx) = dx/(x2
√
2π) inverse-square unit measure on R
Rates: Cauchy: ρν = σ π/2; Horseshoe: ρν = σ
√
2π
Double gamma family
Double gamma density (Griffin & Brown 2013):
2pν(x) =
|x|ν−1 exp(−|x|)
Γ(ν)
ν|x|ν−1
exp(−|x|)
Unit exceedance density: h(x) = K−1|x|−1 exp(−|x|)/2
Normalization const: K = (1 − e−x2/2) |x|−1e−|x| dx/2
Sparsity rate: ρν = K−1ν 3.75ν
Not finite, but the activity index is. . .
AI(H) = inf α > 0 :
1
−1
|x|α
H(dx) < ∞ = 0
Marginal density: sparse limit approximation
Sparse signal plus Gaussian noise model: Y = X + ε
mν(y) =
R
φ(y − x) Pν(dx)
... details in handout
...
= (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
ζ(y) =
R
cosh(yx) − 1 e−x2/2
H(dx); ζ(0) = 0
The product ψ(y) = φ(y)ζ(y) is a probability density!
...symmetric and bimodal
Inverse-power α-stable exceedance measures
H(dx) ∝ dx/(|x|α+1); (0 < α < 2)
Prob density ψ(y) = φ(y)ζ(y) (inverse-power tail)
-2 0 2 4 6
0.000.050.100.150.200.250.30
Inverse-power psi densities
d=2.0
d=1.5
d=1.0
d=0.5
d=0.1
inverse-square exceedance
Tail inflated densities ψα(y) for inverse-power measures.
Tail inflation: inverse-square versus Gaussian
-2 2 4 6
0.05
0.10
0.15
0.20
0.25
Tail inflated densities φ(x)ζ(x) for Gaussian and inverse square
Five implications for inference
(i) Asymptotic marginal density of X + ε is a mixture
(φ Pν)(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + O(ρ2
)
(ii) Two families having the same H are indistinguishable!
e.g., (1 − ν)δ0 + νF and (1 − ν)N(0, ν2) + νF
(iii) the rate parameter is identifiable: mν(0)/φ(0) = 1 − ρ
(iv) the null atom Pν({0}) is not identifiable
(v) If Pν = (1 − ν)δ0 + νF is a sparse spike-F mixture,
... the mixture fractions in Pν and mν are not equal!
ρ = ν (1 − e−x2/2
)F(dx) < ν
Tweedie’s formula for conditional moments
The conditional mgf is
E(etX
| Y) =
m(y + t) φ(y)
φ(y + t) m(y)
=
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
+ o(ρ)
Bayes estimate of the signal:
E(X | Y) = −
d
dy
log
m(y)
φ(y)
=
ρζ (y)
1 − ρ + ρζ(y)
+ o(ρ)
Depends only on the exceedance measure (& rate)
Bayes estimate of signal
H(dx) = dx/(x2
√
2π) E(X | Y) =
ρζ (y)
1 − ρ + ρζ(y)
-5 0 5
-505
Conditional expected value of signal
nu = 0.25
nu = 0.00024
E(signal | Y=y) versus y
at sparsity levels nu = 1/4^k
Conditional density of signal
Cauchy prior ρ = 0.02; y = 3.5; ζ(y) = 55.3
-1 0 1 2 3 4 5 6
0.000.050.100.150.200.25
Conditional density of signal: Cauchy signal
x
cdens
x
y = 3.5
rho = 0.02
zeta(y) = 55.3
Conditional activity probability
Double limit: ρ → 0, |y| → ∞ such that ρζ(y) = λ > 0
DL condition (DLC): lim
y→∞
log ζ(y)
y2
=
1
2
Under the DL condition
lim
ρζ(y)=λ
Pν(|X| > | Y) =
ρζ(y)
1 + ρζ(y)
lim
ρζ(y)=λ
odds(|X| > | Y) = ρζ(y) = λ
for every fixed threshold > 0
— reasonable thresholds: 0.4 ≤ ≤ 0.8 for ρ 0.01
— DLC fails if H has Gaussian or sub-Gaussian tails
e.g., Pν = (1 − ν)δ0 + νN(0, 5)
Sparse Bayes factor for signal activity
Fix a threshold > 0
Signal activity event: + = {X : |X| > }
Pν(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > | Y) = ρζ(y) + o(1)
Bayes factor for +-activity:
BF( +
; y) =
odds( + | Y)
odds( +)
=
ζ(y)
H( +)
— can choose 0.8 so that H( +) = 1.
Vector sparsity in Rd
Essentially identical with Pν spherically symmetric
Standardization: Rd (1 − e− x 2/2) H(dx) = 1
coshd (y) =
Sd
e y, u
U(du)
ζd (t) =
Rd
coshd (tx) − 1 e− x 2/2
H(dx)
mν(y) = (1 − ρ)φd (y) + ρφd (y)ζd ( y ) + o(ρ)
Under the double limit condition...
Pν( X > | Y) =
ρζd ( y )
1 + ρζd ( y )
+ o(1)
BF( +
; y) =
odds( X > | Y)
odds( X > )
=
ζd ( y )
H( +)
Application of vector sparsity to regression
The space Rn is Euclidean with standard inner product
Given an initial subspace: X0 ⊂ Rn
a subspace extension X = span{x1, . . . , xd } ⊂ X⊥
0
and an observation Y ∼ N(µ0 + Xβ, In)
with coefficient vector β ∼ Pν vector-sparse in Rd
What is the Bayes factor for the event β > ?
Mathematical assumptions:
Pν requires an inner product in Rd
so we make the parameter space Euclidean by assumption
β → Xβ is an isometry Rd → Rn (FI metric)
—not component-wise sparsity!
Conclusions under DLC:
odds( β > | Y) = ρ ζd ( PX y )
BF( +; y) = ζd ( PX y )/H( +)
Ten remarks on vector sparsity in regression
(i) limit based on vector sparsity alone ν → 0
(ii) no need for large samples: n = d suffices
(iii) Choice of basis vectors is immaterial: (no orthogonality)
(iv) mν(y) = (1 − ρ)φn(y − µ0) + ρφn(y − µ0)ζd ( PX y ) + o(ρ)
(v) BF( +; y) is a function of the regression SS (not of n)
(vi) Dependence on threshold: BF ∝ in the simplest case
(vii) Connection with BIC, if any, is unclear
(viii) assumes σ2 = 1
(ix) vector sparsity is not component-wise sparsity
(x) component-wise sparsity is basis-dependent
Vector sparsity in contingency tables
Setting: a contingency table with observations Yij ∼ Po(µij)
subspaces X0 = row + col; X = X⊥
0
β ∈ X sparse in the natural Poisson-induced metric
PX y 2
=
cells
(obsij − fitij)2
fitij
odds( β > | Y) = ρζd ( PX y )
Table 1: Bayes factor ζd ( PX y ) for chi-squared at three percentiles
Tail Degrees of freedom
prob 1 2 3 4 5 6 7 8 9 10 11 12
0.05 2.9 2.6 2.4 2.3 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9
0.01 7.7 6.4 5.7 5.3 5.0 4.8 4.6 4.5 4.4 4.3 4.2 4.1
0.001 32.7 26.7 23.8 21.9 20.6 19.6 18.8 18.1 17.5 17.0 16.6 16.2
Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
Sparse process and subset selection (contd)
If we want a conditional distribution on subsets b ⊂ [n]
...we need a process with masses on subspaces Vb
b ⊂ [n] → Vb ⊂ Rn
; HQ
n (Vb) > 0; Pν(X ∈ Vb | Y) =???
1. Singular exceedance process...
HQ
n (dx) =
b⊂[n];b=∅
Qn(b) Hb(dx[b]) δ0(dx[¯b])
KQ
n =
Rn
(1 − e− x 2/2
)HQ
n (dx) =
b⊂[n];b=∅
Qn(b)Kb.
2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}).
e.g. Qn(b) = λ
n + λ
#b
−1
λ
#b
Sparse process and subset selection (contd)
If we want a conditional distribution on subsets b ⊂ [n]
...we need a process with masses on subspaces Vb
b ⊂ [n] → Vb ⊂ Rn
; HQ
n (Vb) > 0; Pν(X ∈ Vb | Y) =???
1. Singular exceedance process...
HQ
n (dx) =
b⊂[n];b=∅
Qn(b) Hb(dx[b]) δ0(dx[¯b])
KQ
n =
Rn
(1 − e− x 2/2
)HQ
n (dx) =
b⊂[n];b=∅
Qn(b)Kb.
2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}).
e.g. Qn(b) = λ
n + λ
#b
−1
λ
#b
Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
Numerical illustration
n = 3 y = (1.5, 0.5, 2.5); ρ = 0.01;
Conditional intensity ρζ1(y) = (0.105, 0.000, 0.553)
Activity prob:
ρζ1(y)
1 − ρ + ρζ1(y)
= (0.10, 0.000, 0.36)
Independence (λ → ∞):
P(sites 1,3 exclusively active) = 0.036
For λ = 10 (non-independence) Qn(b) =
λ
#b
n + λ
#b
−1
excl active sites ∅ 1 2 3 12 13 23 123
P(·) 0.48 0.051 0.000 0.269 0.003 0.173 0.011 0.014
P(sites 1,3 active) ∝ Qn(2) K2 ζ2(
√
(y2
1 + y2
3 )) = 0.187
Summary: Role of a definition
Sparsity implies a characteristic pair (ρ, H)
(i) Equivalent families have the same H
(ii) Zeta function: ζ(y) = (cosh(yx) − 1)e− x 2/2H(dx)
(iii) Marginal density of Y = X + ε is
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y)
(iv) Conditional expectation E(X | Y) = ρζ (y)/(1 − ρ + ρζ(y))
(v) Conditional distribution of X
oddsν( X > | Y) = ρζ(y)
(vi) Subset selection requires a sparse process...
... either iid (conventional)
... or exchangeable and with singular H
(vii) Sparsity is neutral on the BFF spectrum

MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019

  • 1.
    Statistical sparsity andBayes factors Peter McCullagh Department of Statistics, University of Chicago SAMSI, Durham NC, April 2019 Joint with N. Polson (Bka, 2018) and M. Tresoldi
  • 2.
    Outline Univariate sparsity Signal plusnoise model Y = X + ε Sparseness: examples Sparseness: definition as a limit Sparseness: Cauchy-type exceedance measures Marginal density of Y Tail inflation and α-stable measures Tweedie formulae Exceedance probability and Bayes factors Vector sparsity Definition of rates and measures Application to regression Application to contingency tables Sparse processes and subset selection
  • 3.
    Signal plus noisemodel (J&S 2004) n sites: i = 1, . . . , n; (n = 1 suffices here) Signals Xi ∼ P (sparse but iid) Gaussian errors εi ∼ N(0, 1) iid Observations: Yi = Xi + εi (also iid) Inferential targets: m(y) = φ(y − x) P(dx) P(X ∈ dx | Y = y) = φ(y − x)P(dx)/m(y) E(X | Y = y) = ??? how much shrinkage? P(Xi = 0 | Y = y) = local false positive rate References: Johnstone & Silverman (2004); Efron (2008; 2009; 2010; 2011); Benjamini & Hochberg (1995)
  • 4.
    Eight examples ofstatistical sparsity F arbitrary symmetric; γ > 0 arbitrary const (I) Atom and F-slab: (1 − ν)δ0 + νF J&S (2004); Efron. (2008) (ii) G-Spike and F-slab: (1 − ν)N(0, γν2) + νF R&G (2018) (iii) L-Spike and F-slab: (1 − ν)L(γν) + νF G&McC (1993) (iv) C-Spike and F-slab: (1 − ν)C(γν) + νF (v) Double gamma: 1 2 |x|ν−1e−|γx|/Γ(ν) G&B (2013) (vi) Sparse Cauchy: C(ν); density ν π−1/(ν2 + x2) (vii) Sparse horseshoe: log(1 + ν2/x2)/(2πν) CPS (2010) (viii) Sparse F: |x|ν−1 sin(πν/2)π−1/(1 + x2) Mixture fraction ν is small: limν→0 Pν = δ0
  • 5.
    Sparsity definition I:sparse limit Defn I: A family of symmetric distributions {Pν} on R has a sparse limit as ν → 0 if there exists (i) a rate parameter ρν → 0 as ν → 0; (ii) an exceedance measure H(·) such that lim ν→0 ρ−1 ν Pν(|X| > ) = H( + ) < ∞ for every > 0; (iii) H is a Lévy measure: 1 − e−x2/2 H(dx) = 1 (a) Defn satisfied by all examples in literature (b) What are the implications? (i) Equivalent families have the same H: e.g., (i)–(iii) (ii) Non-identifiability of certain functionals (iii) Sparse approx: Pν(|X| > ) = ρνH( + ) + o(ρν) (iv) Sparse-limit approximations for Pν(X | Y) (v) No big-data implications: n = 1 suffices
  • 6.
    Sparsity II: Formalintegral definition Defn II: {Pν} has a sparse limit with rate ρν if there exists a measure H such that lim ν→0 ρ−1 ν R w(x)Pν(dx) = R w(x)H(dx) < ∞ for every w in the space W... Lévy-integrable functions: bounded continuous functions w(x) such that x−2w(x) is also bd and cts. e.g., min(x2, 1); x2e−x2 ; 1 − e−x2 ; (cosh(tx) − 1)e−x2 Implication: sparse approximations are restricted to functions in W! Defn III: Unit measure: (1 − e−x2/2)H(dx) = 1 zeta function: ζ(t) = (cosh(tx) − 1)e−x2/2 H(dx)
  • 7.
    Why define sparsityas a limit? (i) In practice, ρ is small, say ρ < 0.05; so also is ν (ii) But ν is an arbitrary parameterization, whereas ρ is not (iii) Two families having the same H are first-order equivalent (iv) (1 − e−x2/2)H(dx) = 1 implies H(|X| > 1) 1 Pν(|X| > 1) = ρH(1+ ) + o(ρ) ρ φ(x)ζ(x) dx = 1 (v) the limit allows us to develop approximations such as Pν(|X| > 1 | Y) = ρζ(y) 1 + ρζ(y) + o(1) that are based on sparsity rather than sample size.
  • 8.
    Four roles ofthe zeta function ζ(y) = R cosh(yx) − 1)e−x2/2 H(dx) 1. Marginal density of Y = X + ε mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ) 2. Tweedie MGF formula: E(etX | Y) = 1 − ρ + ρζ(y + t) 1 − ρ + ρζ(y) 3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y) 4. As a Bayes factor: odds( β > | Y) odds( β > ) = ζ( PX y ) H( +)
  • 9.
    Four roles ofthe zeta function ζ(y) = R cosh(yx) − 1)e−x2/2 H(dx) 1. Marginal density of Y = X + ε mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ) 2. Tweedie MGF formula: E(etX | Y) = 1 − ρ + ρζ(y + t) 1 − ρ + ρζ(y) 3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y) 4. As a Bayes factor: odds( β > | Y) odds( β > ) = ζ( PX y ) H( +)
  • 10.
    Four roles ofthe zeta function ζ(y) = R cosh(yx) − 1)e−x2/2 H(dx) 1. Marginal density of Y = X + ε mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ) 2. Tweedie MGF formula: E(etX | Y) = 1 − ρ + ρζ(y + t) 1 − ρ + ρζ(y) 3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y) 4. As a Bayes factor: odds( β > | Y) odds( β > ) = ζ( PX y ) H( +)
  • 11.
    Four roles ofthe zeta function ζ(y) = R cosh(yx) − 1)e−x2/2 H(dx) 1. Marginal density of Y = X + ε mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ) 2. Tweedie MGF formula: E(etX | Y) = 1 − ρ + ρζ(y + t) 1 − ρ + ρζ(y) 3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y) 4. As a Bayes factor: odds( β > | Y) odds( β > ) = ζ( PX y ) H( +)
  • 12.
    Sparse families ofCauchy type Sparse Cauchy: X ∼ Cauchy(σ = ν) σ−1 Pσ(X ∈ dx) = dx π(σ2 + x2) → dx πx2 lim σ→0 σ−1 Pσ(|X| > ) = ∞ 2 dx πx2 = 2 π Sparse horseshoe: Pσ(dx) = log(1 + σ2/x2) dx/(2πσ) lim σ→0 σ−1 Pσ(dx) → dx 2πx2 H(dx) = dx/(x2 √ 2π) inverse-square unit measure on R Rates: Cauchy: ρν = σ π/2; Horseshoe: ρν = σ √ 2π
  • 13.
    Double gamma family Doublegamma density (Griffin & Brown 2013): 2pν(x) = |x|ν−1 exp(−|x|) Γ(ν) ν|x|ν−1 exp(−|x|) Unit exceedance density: h(x) = K−1|x|−1 exp(−|x|)/2 Normalization const: K = (1 − e−x2/2) |x|−1e−|x| dx/2 Sparsity rate: ρν = K−1ν 3.75ν Not finite, but the activity index is. . . AI(H) = inf α > 0 : 1 −1 |x|α H(dx) < ∞ = 0
  • 14.
    Marginal density: sparselimit approximation Sparse signal plus Gaussian noise model: Y = X + ε mν(y) = R φ(y − x) Pν(dx) ... details in handout ... = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ) ζ(y) = R cosh(yx) − 1 e−x2/2 H(dx); ζ(0) = 0 The product ψ(y) = φ(y)ζ(y) is a probability density! ...symmetric and bimodal
  • 15.
    Inverse-power α-stable exceedancemeasures H(dx) ∝ dx/(|x|α+1); (0 < α < 2) Prob density ψ(y) = φ(y)ζ(y) (inverse-power tail) -2 0 2 4 6 0.000.050.100.150.200.250.30 Inverse-power psi densities d=2.0 d=1.5 d=1.0 d=0.5 d=0.1 inverse-square exceedance Tail inflated densities ψα(y) for inverse-power measures.
  • 16.
    Tail inflation: inverse-squareversus Gaussian -2 2 4 6 0.05 0.10 0.15 0.20 0.25 Tail inflated densities φ(x)ζ(x) for Gaussian and inverse square
  • 17.
    Five implications forinference (i) Asymptotic marginal density of X + ε is a mixture (φ Pν)(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + O(ρ2 ) (ii) Two families having the same H are indistinguishable! e.g., (1 − ν)δ0 + νF and (1 − ν)N(0, ν2) + νF (iii) the rate parameter is identifiable: mν(0)/φ(0) = 1 − ρ (iv) the null atom Pν({0}) is not identifiable (v) If Pν = (1 − ν)δ0 + νF is a sparse spike-F mixture, ... the mixture fractions in Pν and mν are not equal! ρ = ν (1 − e−x2/2 )F(dx) < ν
  • 18.
    Tweedie’s formula forconditional moments The conditional mgf is E(etX | Y) = m(y + t) φ(y) φ(y + t) m(y) = 1 − ρ + ρζ(y + t) 1 − ρ + ρζ(y) + o(ρ) Bayes estimate of the signal: E(X | Y) = − d dy log m(y) φ(y) = ρζ (y) 1 − ρ + ρζ(y) + o(ρ) Depends only on the exceedance measure (& rate)
  • 19.
    Bayes estimate ofsignal H(dx) = dx/(x2 √ 2π) E(X | Y) = ρζ (y) 1 − ρ + ρζ(y) -5 0 5 -505 Conditional expected value of signal nu = 0.25 nu = 0.00024 E(signal | Y=y) versus y at sparsity levels nu = 1/4^k
  • 20.
    Conditional density ofsignal Cauchy prior ρ = 0.02; y = 3.5; ζ(y) = 55.3 -1 0 1 2 3 4 5 6 0.000.050.100.150.200.25 Conditional density of signal: Cauchy signal x cdens x y = 3.5 rho = 0.02 zeta(y) = 55.3
  • 21.
    Conditional activity probability Doublelimit: ρ → 0, |y| → ∞ such that ρζ(y) = λ > 0 DL condition (DLC): lim y→∞ log ζ(y) y2 = 1 2 Under the DL condition lim ρζ(y)=λ Pν(|X| > | Y) = ρζ(y) 1 + ρζ(y) lim ρζ(y)=λ odds(|X| > | Y) = ρζ(y) = λ for every fixed threshold > 0 — reasonable thresholds: 0.4 ≤ ≤ 0.8 for ρ 0.01 — DLC fails if H has Gaussian or sub-Gaussian tails e.g., Pν = (1 − ν)δ0 + νN(0, 5)
  • 22.
    Sparse Bayes factorfor signal activity Fix a threshold > 0 Signal activity event: + = {X : |X| > } Pν(|X| > ) = ρH( + ) + o(ρ) odds(|X| > ) = ρH( + ) + o(ρ) odds(|X| > | Y) = ρζ(y) + o(1) Bayes factor for +-activity: BF( + ; y) = odds( + | Y) odds( +) = ζ(y) H( +) — can choose 0.8 so that H( +) = 1.
  • 23.
    Vector sparsity inRd Essentially identical with Pν spherically symmetric Standardization: Rd (1 − e− x 2/2) H(dx) = 1 coshd (y) = Sd e y, u U(du) ζd (t) = Rd coshd (tx) − 1 e− x 2/2 H(dx) mν(y) = (1 − ρ)φd (y) + ρφd (y)ζd ( y ) + o(ρ) Under the double limit condition... Pν( X > | Y) = ρζd ( y ) 1 + ρζd ( y ) + o(1) BF( + ; y) = odds( X > | Y) odds( X > ) = ζd ( y ) H( +)
  • 24.
    Application of vectorsparsity to regression The space Rn is Euclidean with standard inner product Given an initial subspace: X0 ⊂ Rn a subspace extension X = span{x1, . . . , xd } ⊂ X⊥ 0 and an observation Y ∼ N(µ0 + Xβ, In) with coefficient vector β ∼ Pν vector-sparse in Rd What is the Bayes factor for the event β > ? Mathematical assumptions: Pν requires an inner product in Rd so we make the parameter space Euclidean by assumption β → Xβ is an isometry Rd → Rn (FI metric) —not component-wise sparsity! Conclusions under DLC: odds( β > | Y) = ρ ζd ( PX y ) BF( +; y) = ζd ( PX y )/H( +)
  • 25.
    Ten remarks onvector sparsity in regression (i) limit based on vector sparsity alone ν → 0 (ii) no need for large samples: n = d suffices (iii) Choice of basis vectors is immaterial: (no orthogonality) (iv) mν(y) = (1 − ρ)φn(y − µ0) + ρφn(y − µ0)ζd ( PX y ) + o(ρ) (v) BF( +; y) is a function of the regression SS (not of n) (vi) Dependence on threshold: BF ∝ in the simplest case (vii) Connection with BIC, if any, is unclear (viii) assumes σ2 = 1 (ix) vector sparsity is not component-wise sparsity (x) component-wise sparsity is basis-dependent
  • 26.
    Vector sparsity incontingency tables Setting: a contingency table with observations Yij ∼ Po(µij) subspaces X0 = row + col; X = X⊥ 0 β ∈ X sparse in the natural Poisson-induced metric PX y 2 = cells (obsij − fitij)2 fitij odds( β > | Y) = ρζd ( PX y ) Table 1: Bayes factor ζd ( PX y ) for chi-squared at three percentiles Tail Degrees of freedom prob 1 2 3 4 5 6 7 8 9 10 11 12 0.05 2.9 2.6 2.4 2.3 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9 0.01 7.7 6.4 5.7 5.3 5.0 4.8 4.6 4.5 4.4 4.3 4.2 4.1 0.001 32.7 26.7 23.8 21.9 20.6 19.6 18.8 18.1 17.5 17.0 16.6 16.2
  • 27.
    Subset selection (toa sparse probabilist) Need a sparse signal process X = (X1, X2, . . .) 1. X[n] = (X1, . . . , Xn) ∼ Pn,ν 2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .) lim ν→0 ρ−1 ν Rn w(x)Pn,ν(dx) = Rn w(x)Hn(dx) 3. Consistency: Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R) 4. Consistency of inverse-power measures for n = 1, 2, . . . Hn(dx) = Γ(n/2 + α/2) πn/2 dx x n+α , Kn = Rn (1 − e− x 2/2 ) Hn(dx) = O(nα/2 ). 5. Zeta functions ζn(y) = K−1 n Rn cosh(yx) − 1 e− x 2/2 Hn(dx)
  • 28.
    Subset selection (toa sparse probabilist) Need a sparse signal process X = (X1, X2, . . .) 1. X[n] = (X1, . . . , Xn) ∼ Pn,ν 2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .) lim ν→0 ρ−1 ν Rn w(x)Pn,ν(dx) = Rn w(x)Hn(dx) 3. Consistency: Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R) 4. Consistency of inverse-power measures for n = 1, 2, . . . Hn(dx) = Γ(n/2 + α/2) πn/2 dx x n+α , Kn = Rn (1 − e− x 2/2 ) Hn(dx) = O(nα/2 ). 5. Zeta functions ζn(y) = K−1 n Rn cosh(yx) − 1 e− x 2/2 Hn(dx)
  • 29.
    Subset selection (toa sparse probabilist) Need a sparse signal process X = (X1, X2, . . .) 1. X[n] = (X1, . . . , Xn) ∼ Pn,ν 2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .) lim ν→0 ρ−1 ν Rn w(x)Pn,ν(dx) = Rn w(x)Hn(dx) 3. Consistency: Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R) 4. Consistency of inverse-power measures for n = 1, 2, . . . Hn(dx) = Γ(n/2 + α/2) πn/2 dx x n+α , Kn = Rn (1 − e− x 2/2 ) Hn(dx) = O(nα/2 ). 5. Zeta functions ζn(y) = K−1 n Rn cosh(yx) − 1 e− x 2/2 Hn(dx)
  • 30.
    Subset selection (toa sparse probabilist) Need a sparse signal process X = (X1, X2, . . .) 1. X[n] = (X1, . . . , Xn) ∼ Pn,ν 2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .) lim ν→0 ρ−1 ν Rn w(x)Pn,ν(dx) = Rn w(x)Hn(dx) 3. Consistency: Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R) 4. Consistency of inverse-power measures for n = 1, 2, . . . Hn(dx) = Γ(n/2 + α/2) πn/2 dx x n+α , Kn = Rn (1 − e− x 2/2 ) Hn(dx) = O(nα/2 ). 5. Zeta functions ζn(y) = K−1 n Rn cosh(yx) − 1 e− x 2/2 Hn(dx)
  • 31.
    Subset selection (toa sparse probabilist) Need a sparse signal process X = (X1, X2, . . .) 1. X[n] = (X1, . . . , Xn) ∼ Pn,ν 2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .) lim ν→0 ρ−1 ν Rn w(x)Pn,ν(dx) = Rn w(x)Hn(dx) 3. Consistency: Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R) 4. Consistency of inverse-power measures for n = 1, 2, . . . Hn(dx) = Γ(n/2 + α/2) πn/2 dx x n+α , Kn = Rn (1 − e− x 2/2 ) Hn(dx) = O(nα/2 ). 5. Zeta functions ζn(y) = K−1 n Rn cosh(yx) − 1 e− x 2/2 Hn(dx)
  • 32.
    Sparse process andsubset selection (contd) If we want a conditional distribution on subsets b ⊂ [n] ...we need a process with masses on subspaces Vb b ⊂ [n] → Vb ⊂ Rn ; HQ n (Vb) > 0; Pν(X ∈ Vb | Y) =??? 1. Singular exceedance process... HQ n (dx) = b⊂[n];b=∅ Qn(b) Hb(dx[b]) δ0(dx[¯b]) KQ n = Rn (1 − e− x 2/2 )HQ n (dx) = b⊂[n];b=∅ Qn(b)Kb. 2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}). e.g. Qn(b) = λ n + λ #b −1 λ #b
  • 33.
    Sparse process andsubset selection (contd) If we want a conditional distribution on subsets b ⊂ [n] ...we need a process with masses on subspaces Vb b ⊂ [n] → Vb ⊂ Rn ; HQ n (Vb) > 0; Pν(X ∈ Vb | Y) =??? 1. Singular exceedance process... HQ n (dx) = b⊂[n];b=∅ Qn(b) Hb(dx[b]) δ0(dx[¯b]) KQ n = Rn (1 − e− x 2/2 )HQ n (dx) = b⊂[n];b=∅ Qn(b)Kb. 2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}). e.g. Qn(b) = λ n + λ #b −1 λ #b
  • 34.
    Sparse process andsubset selection (contd) 1. Marginal distribution of Y = X + ε is a mixture mν(y) = φn(y) 1 − ρKQ n + ρ b⊂[n];b=∅ Qn(b) Kb ζb( y[b] ) = (1 − ρKQ n )φn(y) + ρ b⊂[n];b=∅ Qn(b) Kb ψ(y[b]) φ(y[¯b]) 2. Conditional distribution on subsets B = {i : |Xi| > } Pn,ν(b | Y) ∝ ρ Qn(b) Kb ζb( y[b] ) b = ∅ 1 − ρKQ n b = ∅. 3. Bayes factor for b ⊂ [n] is ζb( y[b] )
  • 35.
    Sparse process andsubset selection (contd) 1. Marginal distribution of Y = X + ε is a mixture mν(y) = φn(y) 1 − ρKQ n + ρ b⊂[n];b=∅ Qn(b) Kb ζb( y[b] ) = (1 − ρKQ n )φn(y) + ρ b⊂[n];b=∅ Qn(b) Kb ψ(y[b]) φ(y[¯b]) 2. Conditional distribution on subsets B = {i : |Xi| > } Pn,ν(b | Y) ∝ ρ Qn(b) Kb ζb( y[b] ) b = ∅ 1 − ρKQ n b = ∅. 3. Bayes factor for b ⊂ [n] is ζb( y[b] )
  • 36.
    Sparse process andsubset selection (contd) 1. Marginal distribution of Y = X + ε is a mixture mν(y) = φn(y) 1 − ρKQ n + ρ b⊂[n];b=∅ Qn(b) Kb ζb( y[b] ) = (1 − ρKQ n )φn(y) + ρ b⊂[n];b=∅ Qn(b) Kb ψ(y[b]) φ(y[¯b]) 2. Conditional distribution on subsets B = {i : |Xi| > } Pn,ν(b | Y) ∝ ρ Qn(b) Kb ζb( y[b] ) b = ∅ 1 − ρKQ n b = ∅. 3. Bayes factor for b ⊂ [n] is ζb( y[b] )
  • 37.
    Numerical illustration n =3 y = (1.5, 0.5, 2.5); ρ = 0.01; Conditional intensity ρζ1(y) = (0.105, 0.000, 0.553) Activity prob: ρζ1(y) 1 − ρ + ρζ1(y) = (0.10, 0.000, 0.36) Independence (λ → ∞): P(sites 1,3 exclusively active) = 0.036 For λ = 10 (non-independence) Qn(b) = λ #b n + λ #b −1 excl active sites ∅ 1 2 3 12 13 23 123 P(·) 0.48 0.051 0.000 0.269 0.003 0.173 0.011 0.014 P(sites 1,3 active) ∝ Qn(2) K2 ζ2( √ (y2 1 + y2 3 )) = 0.187
  • 38.
    Summary: Role ofa definition Sparsity implies a characteristic pair (ρ, H) (i) Equivalent families have the same H (ii) Zeta function: ζ(y) = (cosh(yx) − 1)e− x 2/2H(dx) (iii) Marginal density of Y = X + ε is mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) (iv) Conditional expectation E(X | Y) = ρζ (y)/(1 − ρ + ρζ(y)) (v) Conditional distribution of X oddsν( X > | Y) = ρζ(y) (vi) Subset selection requires a sparse process... ... either iid (conventional) ... or exchangeable and with singular H (vii) Sparsity is neutral on the BFF spectrum