Logit stick-breaking priors for partially exchangeable count data

Logit stick-breaking priors for partially
exchangeable count data
Tommaso Rigon
http://tommasorigon.github.io
Bocconi University
SIS 2018, Palermo, 22-06-2018
Tommaso Rigon (Bocconi) LSBP SIS 2018 1 / 20

Introduction
Partial exchangeability
A bivariate sequence (Xi , Yj )i,j≥1 is partially exchangeable if
(X1, . . . , Xn1
, Y1, . . . , Yn2
)
d
= (Xσ(1), . . . , Xσ(n1), Yσ (1), . . . , Yσ (n2)),
for any n1, n2 ≥ 1 and any permutations σ and σ .
de Finetti’s representation theorem
The sequence (Xi , Yj )i,j≥1 is partially exchangeable if and only if
P(X1 ∈ A1, . . . Xn1
∈ An1
,Y1 ∈ B1, . . . , Yn2
∈ Bn2
) =
=
P2
n1
i=1
p1(Ai )
n2
j=1
p2(Bj )Q2(dp1, dp2).

Introduction
Partial exchangeability
Thus, a draw from (Xi , Yj )i,j≥1 can be expressed hierarchically:
(Xi | p1)
iid
∼ p1, (Yj | p2)
iid
∼ p2,
(p1, p2) ∼ Q2
where each (Xi | p1) is independent on each (Yj | p2).
The quantity (p1, p2) is a vector of random probability measures and Q2
can be interpreted as their prior law.
If p1
|=
p2 then the observations (X1, . . . , Xn1
) and (Y1, . . . , Yn2
) can be
modeled separately and independently.
Dependence between p1 and p2 allows for borrowing of information
across the sequences.

Introduction
Partial exchangeability with count data
Let Y1, . . . , Yn ∈ N be a collection of count response variables, each
corresponding to a qualitative covariate xi ∈ {1, . . . , J}.
Each data point yi is a conditionally independent draw from
(Yi | xi = j)
ind
∼ pj , i = 1, . . . , n,
where pj denotes the probability mass function of (Yi | xi = j).
This is an instance of partial exchangeability with count data.
Model elicitation is completed by specifying a prior law QJ for the vector
of random probability distributions (p1, . . . , pJ ) ∼ QJ .

Introduction
Desiderata
We seek for a Bayesian inferential procedure which:
provides a ﬂexible, i.e. nonparametric, estimate for each law pj ;
allows for borrowing of information across the J groups;
is scalable, in the sense that is computationally feasible for large n or
large p;
has a reasonable interpretation, thus facilitating the incorporation of
prior information.

Introduction
Bayesian nonparametric mixture models
A ﬂexible Bayesian model for density estimation assumes
p(y) =
Θ
K(y; θ)dP(θ),
where K(y; θ) is a known parametric kernel (e.g. Poisson, negative
binomial), and P(θ) is a prior mixing measure.

Introduction
Bayesian nonparametric mixture models
A ﬂexible Bayesian model for density estimation assumes
p(y) =
Θ
K(y; θ)dP(θ),
where K(y; θ) is a known parametric kernel (e.g. Poisson, negative
binomial), and P(θ) is a prior mixing measure.
If the mixing measure is a Dirichlet process (Lo 1984), then exploiting
the stick-breaking construction:
p(y) =
Θ
K(y; θ)dP(θ) =
∞
h=1
πhK(y; θh), πh = νh
h−1
l=1
(1 − νl ),
with θh
iid
∼ P0 and νh
iid
∼ Beta(1, α), for h = 1, . . . , ∞.

Introduction
The hierarchical Dirichlet process
A popular extension of the Lo model for partially exchangeable data is
the hierarchical Dirichlet process (Teh et al. 2006).
In the hierarchical Dirichlet process, for j = 1, . . . , J,
pj (y) =
Θ
K(y; θ)dPj (θ) =
∞
h=1
πhj K(y; θh),
(Pj | P0)
iid
∼ DP(αP0),
P0 ∼ DP(α0P00).
Under this specification, different groups shares the same atoms, while
having different mixture weights =⇒ borrowing of information.
Alternative models? Simple conditional algorithms?

Introduction
Main contributions
We explored computational, interpretational and theoretical aspects of
the logit stick-breaking process of Ren et al. (2011) in the partially
exchangeable setting, and using count data.
The LSBP can be constructed via sequential logistic regression, allowing
a more clear interpretation of the parameters involved.
For the LSBP we derived an eﬃcient Gibbs sampler based on a
Pólya-gamma data-augmentation.
Further theoretical support.

Logit stick-breaking process
The LSBP model
Our proposal has the same structure of the HDP:
pj (y) =
Θ
Pois(y; θ)dPj (θ) =
∞
h=1
πhj Pois(y; θh), j = 1, . . . , J,
with conditionally conjugate prior for the atoms θh
iid
∼ Gamma(aθ, bθ).
The pj (y)s share the atoms θh and are characterized by group-speciﬁc
mixing weights.
The mixing weights πhl have a stick-breaking representation. Moreover,
the prior of the LSBP is diﬀerent from the one of the HDP.

Hierarchical representation
Samples from a LSBP model can be obtained hierarchically.
For each data point yi , sample the group indicator Gi denoting the
mixture component
pr(Gi = h | xi = j) = πhj = νhj
h−1
l=1
(1 − νlj ).
Then, conditionally on Gi , sample the count response variable from
(Yi | Gi = h) ∼ Pois(θh).

Sequential interpretation
Can we interpret the stick-breaking weights νhl ?
Yes, indeed they can be rearranged as:
νhj =
πhj
1 −
h−1
l=1 πlj
=
pr(Gi = h | xi = j)
pr(Gi > h − 1 | xi = j)
= pr(Gi = h | Gi > h − 1, xi = j).
Each νhj is the probability of being allocated to component h,
conditionally on the event of surviving to the previous components.
Each I(Gi = h) = ζih is the assignment indicator of each unit to the h-th
component
ζih = zih
h−1
l=1
(1 − zil ), (zih | xi = j) ∼ Bern(νhj ).

Continuation-ratio logistic regressions
We need some prior speciﬁcation for stick-breaking weights νhl .
Consistently with classical generalized linear model, a natural choice is
to deﬁne
logit(νhj ) = αhj , with αh = (αh1, . . . , αhJ )
iid
∼ NJ (µα, Σα),
independently for every h = 1, . . . , +∞.
If the matrix Σα is diagonal, then the mixture weights πhj are a priori
independent across groups.
Stronger borrowing of information—i.e. dependence across the mixing
weights—can be induced for non-diagonal choices of Σα.

Prior quantities
Prior moments
Let (P1, . . . , PJ ) be a vector of random probability measure induced by the
LSBP. Then, for any measurable set B, and for any j and j , then
E{Pj (B)} = P0(B),
cov{Pj (B), Pj (B)} = P0(B)(1 − P0(B))
E(ν1j ν1j )
E(ν1j ) + E(ν1j ) − E(ν1j ν1j )
.
These expectations have not a closed form solution, but they can be
easily obtained numerically.
The correlation corr{Pj (B), Pj (B)} does not depend on B, and
therefore it is often interpreted as a global measure of dependence.

Deterministic truncation of the infinite process
The LSBP is an infinite dimensional process =⇒ computational
challenges.
We propose a truncated version of the vector of random probability
measure (P1, . . . , PJ ), which can be regarded as an approximation of the
infinite process.
We induce the truncation by letting νHj = 1 for some integer H > 1,
which guarantees that
H
h=1 πhj = 1 almost surely.
According to Theorem 1 in Rigon and Durante (2018), the “discrepancy"
between the two processes is exponentially decreasing in H.

Posterior inference
The Pólya-gamma data augmentation
The Gibbs sampler is based on the Pólya-gamma data augmentation,
which relies on the integral identity
ezihψ(xi ) αh
1 + eψ(xi ) αh
=
1
2 R+
f (ωih) exp (zih − 0.5)ψ(xi ) αh − ωih(ψ(xi ) αh)2
/2 dωih,
where p(ωi ) is the pdf of a PG and ψ(xi ) = {I(xi = 1), . . . , I(xi = J)} .

Posterior inference
ezihψ(xi ) αh
1 + eψ(xi ) αh
=
1
2 R+
/2 dωih,
The augmented log-likelihood has a quadratic form =⇒ simple
computations and conjugacy with Gaussian priors.

Posterior inference
ezihψ(xi ) αh
1 + eψ(xi ) αh
=
1
2 R+
/2 dωih,
The augmented log-likelihood has a quadratic form =⇒ simple
computations and conjugacy with Gaussian priors.
The conditional distribution of (ωi | −) is still in the class of the
Pólya-gamma distributions =⇒ conjugacy.

Posterior inference
Posterior inference via Gibbs sampling
for i from 1 to n do update Gi from the discrete variable with probabilities
pr(Gi = h | −) =
πhxi
Pois(yi ; θh)
H
q=1 πqxi Pois(yi ; θq)
,
for every h = 1, . . . , H. From Gi derive the associated zih indicators.
for h from 1 to H − 1 do update the logit stick-breaking parameters αh
for every i such that Gi > h − 1 do sample the Pòlya–gamma data ωih from
(ωih | −) ∼ PG{1, ψ(xi ) αh}.
Given the Pòlya–gamma augmented data, update αh from the full conditional
(αh | −) ∼ NR (µαh , Σαh ), standard Bayesian linear regression
for h from 1 to H do update each kernel parameter θh from
(θh | −) ∼ Gamma

aθ +
i:Gi =h
yi , bθ +
n
i=1
I(Gi = h)

 .

Illustration
Application to the seizure dataset
We apply the LSBP Poisson mixture model to the seizure dataset, which
is also available in the flexmix R package.
The dataset consists of daily myoclonic seizure counts (seizures) for a
single subject, comprising a series of n = 140 days.
After 27 days of baseline observation (Treatment:No), the subject
received monthly infusions of intravenous gamma globulin
(Treatment:Yes).
We aim to compare the J = 2 groups: days with treatment and days
without treatment.

Illustration
Application to the seizure dataset
Treatment: Yes Treatment: No
RelativefrequencyLSBPPoissonmixture
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
Number of seizures
Estimatedpdf

Discussion and conclusions
Possible extensions
The LSBP for partially exchangeable random variables could be used as
a building block for more sophisticated models.
For instance, one could use the partially exchangeable LSBP as a prior
for inﬁnite hidden Markov models or for topic modeling, where the HDP
is usually employed.
The computational advantages of the LSBP might lead to major
improvements in those settings.

Discussion and conclusions
Summary
We proposed a Bayesian nonparametric mixture model for partially
exchangeable count data.
We explored some of its theoretical properties and we developed a
simple Gibbs sampler for posterior inference.
References
Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using
Polya-Gamma latent variables. Journal of the American Statistical Association, 108(504), 1–42.
Ren, L., Du, L., Carin, L. and Dunson, D. B. (2011), Logistic stick-breaking process, Journal of Machine
Learning Research 12, 203–239.
Rigon, T. and Durante, D., (2018), Logit stick-breaking priors for Bayesian density regression, ArXiv.
Rodriguez, A. and Dunson, D. B. (2011), Nonparametric Bayesian models through probit stick-breaking
processes, Bayesian Analysis 6(1), 145–178.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the
American Statistical Association, 101(476), 1566—1581.

Logit stick-breaking priors for partially exchangeable count data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Logit stick-breaking priors for partially exchangeable count data

Similar to Logit stick-breaking priors for partially exchangeable count data (20)

Recently uploaded

Recently uploaded (20)

Logit stick-breaking priors for partially exchangeable count data