The talk is concerned with a particular definition of statistical sparsity as a stochastic limit. The limit definition is satisfied by every example that has been proposed in the literature on sparse signal detection, so, in that sense it is uncontroversial. Nonetheless, the definition has implications for sparse signal detection. For example, it puts very specific limits on the types of inferential questions (integrals or conditional expectations) that we can hope to address. It also implies that certain pairs of sparse models are first -order equivalent, or effectively equivalent.
(Joint work with Nick Polson)
MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, Peter McCullagh, April 29, 2019
1. Statistical sparsity and Bayes factors
Peter McCullagh
Department of Statistics, University of Chicago
SAMSI, Durham NC, April 2019
Joint with N. Polson (Bka, 2018) and M. Tresoldi
2. Outline
Univariate sparsity
Signal plus noise model Y = X + ε
Sparseness: examples
Sparseness: definition as a limit
Sparseness: Cauchy-type exceedance measures
Marginal density of Y
Tail inflation and α-stable measures
Tweedie formulae
Exceedance probability and Bayes factors
Vector sparsity
Definition of rates and measures
Application to regression
Application to contingency tables
Sparse processes and subset selection
3. Signal plus noise model (J&S 2004)
n sites: i = 1, . . . , n; (n = 1 suffices here)
Signals Xi ∼ P (sparse but iid)
Gaussian errors εi ∼ N(0, 1) iid
Observations: Yi = Xi + εi (also iid)
Inferential targets:
m(y) = φ(y − x) P(dx)
P(X ∈ dx | Y = y) = φ(y − x)P(dx)/m(y)
E(X | Y = y) = ??? how much shrinkage?
P(Xi = 0 | Y = y) = local false positive rate
References:
Johnstone & Silverman (2004); Efron (2008; 2009; 2010; 2011);
Benjamini & Hochberg (1995)
4. Eight examples of statistical sparsity
F arbitrary symmetric; γ > 0 arbitrary const
(I) Atom and F-slab: (1 − ν)δ0 + νF J&S (2004); Efron. (2008)
(ii) G-Spike and F-slab: (1 − ν)N(0, γν2) + νF R&G (2018)
(iii) L-Spike and F-slab: (1 − ν)L(γν) + νF G&McC (1993)
(iv) C-Spike and F-slab: (1 − ν)C(γν) + νF
(v) Double gamma: 1
2 |x|ν−1e−|γx|/Γ(ν) G&B (2013)
(vi) Sparse Cauchy: C(ν); density ν π−1/(ν2 + x2)
(vii) Sparse horseshoe: log(1 + ν2/x2)/(2πν) CPS (2010)
(viii) Sparse F: |x|ν−1 sin(πν/2)π−1/(1 + x2)
Mixture fraction ν is small: limν→0 Pν = δ0
5. Sparsity definition I: sparse limit
Defn I: A family of symmetric distributions {Pν} on R has a
sparse limit as ν → 0 if there exists
(i) a rate parameter ρν → 0 as ν → 0;
(ii) an exceedance measure H(·) such that
lim
ν→0
ρ−1
ν Pν(|X| > ) = H( +
) < ∞
for every > 0;
(iii) H is a Lévy measure: 1 − e−x2/2 H(dx) = 1
(a) Defn satisfied by all examples in literature
(b) What are the implications?
(i) Equivalent families have the same H: e.g., (i)–(iii)
(ii) Non-identifiability of certain functionals
(iii) Sparse approx: Pν(|X| > ) = ρνH( +
) + o(ρν)
(iv) Sparse-limit approximations for Pν(X | Y)
(v) No big-data implications: n = 1 suffices
6. Sparsity II: Formal integral definition
Defn II: {Pν} has a sparse limit with rate ρν if there exists a
measure H such that
lim
ν→0
ρ−1
ν
R
w(x)Pν(dx) =
R
w(x)H(dx) < ∞
for every w in the space W...
Lévy-integrable functions: bounded continuous functions w(x)
such that x−2w(x) is also bd and cts.
e.g., min(x2, 1); x2e−x2
; 1 − e−x2
; (cosh(tx) − 1)e−x2
Implication: sparse approximations are restricted to functions in W!
Defn III: Unit measure: (1 − e−x2/2)H(dx) = 1
zeta function: ζ(t) = (cosh(tx) − 1)e−x2/2 H(dx)
7. Why define sparsity as a limit?
(i) In practice, ρ is small, say ρ < 0.05; so also is ν
(ii) But ν is an arbitrary parameterization, whereas ρ is not
(iii) Two families having the same H are first-order equivalent
(iv) (1 − e−x2/2)H(dx) = 1 implies H(|X| > 1) 1
Pν(|X| > 1) = ρH(1+
) + o(ρ) ρ
φ(x)ζ(x) dx = 1
(v) the limit allows us to develop approximations such as
Pν(|X| > 1 | Y) =
ρζ(y)
1 + ρζ(y)
+ o(1)
that are based on sparsity rather than sample size.
8. Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
9. Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
10. Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
11. Four roles of the zeta function
ζ(y) =
R
cosh(yx) − 1)e−x2/2
H(dx)
1. Marginal density of Y = X + ε
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
2. Tweedie MGF formula:
E(etX
| Y) =
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
3. As a L-R statistic: L(ˆρ)/L(0) = max 1, ζ(y)
4. As a Bayes factor:
odds( β > | Y)
odds( β > )
=
ζ( PX y )
H( +)
13. Double gamma family
Double gamma density (Griffin & Brown 2013):
2pν(x) =
|x|ν−1 exp(−|x|)
Γ(ν)
ν|x|ν−1
exp(−|x|)
Unit exceedance density: h(x) = K−1|x|−1 exp(−|x|)/2
Normalization const: K = (1 − e−x2/2) |x|−1e−|x| dx/2
Sparsity rate: ρν = K−1ν 3.75ν
Not finite, but the activity index is. . .
AI(H) = inf α > 0 :
1
−1
|x|α
H(dx) < ∞ = 0
14. Marginal density: sparse limit approximation
Sparse signal plus Gaussian noise model: Y = X + ε
mν(y) =
R
φ(y − x) Pν(dx)
... details in handout
...
= (1 − ρ)φ(y) + ρφ(y)ζ(y) + o(ρ)
ζ(y) =
R
cosh(yx) − 1 e−x2/2
H(dx); ζ(0) = 0
The product ψ(y) = φ(y)ζ(y) is a probability density!
...symmetric and bimodal
16. Tail inflation: inverse-square versus Gaussian
-2 2 4 6
0.05
0.10
0.15
0.20
0.25
Tail inflated densities φ(x)ζ(x) for Gaussian and inverse square
17. Five implications for inference
(i) Asymptotic marginal density of X + ε is a mixture
(φ Pν)(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y) + O(ρ2
)
(ii) Two families having the same H are indistinguishable!
e.g., (1 − ν)δ0 + νF and (1 − ν)N(0, ν2) + νF
(iii) the rate parameter is identifiable: mν(0)/φ(0) = 1 − ρ
(iv) the null atom Pν({0}) is not identifiable
(v) If Pν = (1 − ν)δ0 + νF is a sparse spike-F mixture,
... the mixture fractions in Pν and mν are not equal!
ρ = ν (1 − e−x2/2
)F(dx) < ν
18. Tweedie’s formula for conditional moments
The conditional mgf is
E(etX
| Y) =
m(y + t) φ(y)
φ(y + t) m(y)
=
1 − ρ + ρζ(y + t)
1 − ρ + ρζ(y)
+ o(ρ)
Bayes estimate of the signal:
E(X | Y) = −
d
dy
log
m(y)
φ(y)
=
ρζ (y)
1 − ρ + ρζ(y)
+ o(ρ)
Depends only on the exceedance measure (& rate)
19. Bayes estimate of signal
H(dx) = dx/(x2
√
2π) E(X | Y) =
ρζ (y)
1 − ρ + ρζ(y)
-5 0 5
-505
Conditional expected value of signal
nu = 0.25
nu = 0.00024
E(signal | Y=y) versus y
at sparsity levels nu = 1/4^k
20. Conditional density of signal
Cauchy prior ρ = 0.02; y = 3.5; ζ(y) = 55.3
-1 0 1 2 3 4 5 6
0.000.050.100.150.200.25
Conditional density of signal: Cauchy signal
x
cdens
x
y = 3.5
rho = 0.02
zeta(y) = 55.3
21. Conditional activity probability
Double limit: ρ → 0, |y| → ∞ such that ρζ(y) = λ > 0
DL condition (DLC): lim
y→∞
log ζ(y)
y2
=
1
2
Under the DL condition
lim
ρζ(y)=λ
Pν(|X| > | Y) =
ρζ(y)
1 + ρζ(y)
lim
ρζ(y)=λ
odds(|X| > | Y) = ρζ(y) = λ
for every fixed threshold > 0
— reasonable thresholds: 0.4 ≤ ≤ 0.8 for ρ 0.01
— DLC fails if H has Gaussian or sub-Gaussian tails
e.g., Pν = (1 − ν)δ0 + νN(0, 5)
22. Sparse Bayes factor for signal activity
Fix a threshold > 0
Signal activity event: + = {X : |X| > }
Pν(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > ) = ρH( +
) + o(ρ)
odds(|X| > | Y) = ρζ(y) + o(1)
Bayes factor for +-activity:
BF( +
; y) =
odds( + | Y)
odds( +)
=
ζ(y)
H( +)
— can choose 0.8 so that H( +) = 1.
23. Vector sparsity in Rd
Essentially identical with Pν spherically symmetric
Standardization: Rd (1 − e− x 2/2) H(dx) = 1
coshd (y) =
Sd
e y, u
U(du)
ζd (t) =
Rd
coshd (tx) − 1 e− x 2/2
H(dx)
mν(y) = (1 − ρ)φd (y) + ρφd (y)ζd ( y ) + o(ρ)
Under the double limit condition...
Pν( X > | Y) =
ρζd ( y )
1 + ρζd ( y )
+ o(1)
BF( +
; y) =
odds( X > | Y)
odds( X > )
=
ζd ( y )
H( +)
24. Application of vector sparsity to regression
The space Rn is Euclidean with standard inner product
Given an initial subspace: X0 ⊂ Rn
a subspace extension X = span{x1, . . . , xd } ⊂ X⊥
0
and an observation Y ∼ N(µ0 + Xβ, In)
with coefficient vector β ∼ Pν vector-sparse in Rd
What is the Bayes factor for the event β > ?
Mathematical assumptions:
Pν requires an inner product in Rd
so we make the parameter space Euclidean by assumption
β → Xβ is an isometry Rd → Rn (FI metric)
—not component-wise sparsity!
Conclusions under DLC:
odds( β > | Y) = ρ ζd ( PX y )
BF( +; y) = ζd ( PX y )/H( +)
25. Ten remarks on vector sparsity in regression
(i) limit based on vector sparsity alone ν → 0
(ii) no need for large samples: n = d suffices
(iii) Choice of basis vectors is immaterial: (no orthogonality)
(iv) mν(y) = (1 − ρ)φn(y − µ0) + ρφn(y − µ0)ζd ( PX y ) + o(ρ)
(v) BF( +; y) is a function of the regression SS (not of n)
(vi) Dependence on threshold: BF ∝ in the simplest case
(vii) Connection with BIC, if any, is unclear
(viii) assumes σ2 = 1
(ix) vector sparsity is not component-wise sparsity
(x) component-wise sparsity is basis-dependent
26. Vector sparsity in contingency tables
Setting: a contingency table with observations Yij ∼ Po(µij)
subspaces X0 = row + col; X = X⊥
0
β ∈ X sparse in the natural Poisson-induced metric
PX y 2
=
cells
(obsij − fitij)2
fitij
odds( β > | Y) = ρζd ( PX y )
Table 1: Bayes factor ζd ( PX y ) for chi-squared at three percentiles
Tail Degrees of freedom
prob 1 2 3 4 5 6 7 8 9 10 11 12
0.05 2.9 2.6 2.4 2.3 2.2 2.1 2.1 2.0 2.0 2.0 2.0 1.9
0.01 7.7 6.4 5.7 5.3 5.0 4.8 4.6 4.5 4.4 4.3 4.2 4.1
0.001 32.7 26.7 23.8 21.9 20.6 19.6 18.8 18.1 17.5 17.0 16.6 16.2
27. Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
28. Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
29. Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
30. Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
31. Subset selection (to a sparse probabilist)
Need a sparse signal process X = (X1, X2, . . .)
1. X[n] = (X1, . . . , Xn) ∼ Pn,ν
2. Sparsity rate ρν, exceedance measure H = (H1, H2, . . .)
lim
ν→0
ρ−1
ν
Rn
w(x)Pn,ν(dx) =
Rn
w(x)Hn(dx)
3. Consistency:
Pn,ν(A) = Pn+1,ν(A × R) =⇒ Hn(A) = Hn+1(A × R)
4. Consistency of inverse-power measures for n = 1, 2, . . .
Hn(dx) =
Γ(n/2 + α/2)
πn/2
dx
x n+α
,
Kn =
Rn
(1 − e− x 2/2
) Hn(dx) = O(nα/2
).
5. Zeta functions
ζn(y) = K−1
n
Rn
cosh(yx) − 1 e− x 2/2
Hn(dx)
32. Sparse process and subset selection (contd)
If we want a conditional distribution on subsets b ⊂ [n]
...we need a process with masses on subspaces Vb
b ⊂ [n] → Vb ⊂ Rn
; HQ
n (Vb) > 0; Pν(X ∈ Vb | Y) =???
1. Singular exceedance process...
HQ
n (dx) =
b⊂[n];b=∅
Qn(b) Hb(dx[b]) δ0(dx[¯b])
KQ
n =
Rn
(1 − e− x 2/2
)HQ
n (dx) =
b⊂[n];b=∅
Qn(b)Kb.
2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}).
e.g. Qn(b) = λ
n + λ
#b
−1
λ
#b
33. Sparse process and subset selection (contd)
If we want a conditional distribution on subsets b ⊂ [n]
...we need a process with masses on subspaces Vb
b ⊂ [n] → Vb ⊂ Rn
; HQ
n (Vb) > 0; Pν(X ∈ Vb | Y) =???
1. Singular exceedance process...
HQ
n (dx) =
b⊂[n];b=∅
Qn(b) Hb(dx[b]) δ0(dx[¯b])
KQ
n =
Rn
(1 − e− x 2/2
)HQ
n (dx) =
b⊂[n];b=∅
Qn(b)Kb.
2. HQ is consistent if Qn(b) = Qn+1(b) + Qn+1(b ∪ {n + 1}).
e.g. Qn(b) = λ
n + λ
#b
−1
λ
#b
34. Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
35. Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
36. Sparse process and subset selection (contd)
1. Marginal distribution of Y = X + ε is a mixture
mν(y) = φn(y) 1 − ρKQ
n + ρ
b⊂[n];b=∅
Qn(b) Kb ζb( y[b] )
= (1 − ρKQ
n )φn(y) + ρ
b⊂[n];b=∅
Qn(b) Kb ψ(y[b]) φ(y[¯b])
2. Conditional distribution on subsets B = {i : |Xi| > }
Pn,ν(b | Y) ∝
ρ Qn(b) Kb ζb( y[b] ) b = ∅
1 − ρKQ
n b = ∅.
3. Bayes factor for b ⊂ [n] is ζb( y[b] )
38. Summary: Role of a definition
Sparsity implies a characteristic pair (ρ, H)
(i) Equivalent families have the same H
(ii) Zeta function: ζ(y) = (cosh(yx) − 1)e− x 2/2H(dx)
(iii) Marginal density of Y = X + ε is
mν(y) = (1 − ρ)φ(y) + ρφ(y)ζ(y)
(iv) Conditional expectation E(X | Y) = ρζ (y)/(1 − ρ + ρζ(y))
(v) Conditional distribution of X
oddsν( X > | Y) = ρζ(y)
(vi) Subset selection requires a sparse process...
... either iid (conventional)
... or exchangeable and with singular H
(vii) Sparsity is neutral on the BFF spectrum