Bayesian regression models and treed Gaussian process models

Bayesian regression models and treed
Gaussian process models
Tommaso Rigon
22-03-2017
Tommaso Rigon TGP 2017 1 / 36

A general regression problem
Let (xi, yi), for i = 1, . . . , n, denotes the collection of a set covariates with
its response variable, respectively. Suppose that
yi | xi
ind
∼ f(yi | xi),
independently for i = 1, . . . , n according to some unknown distribution
function f(yi | xi), with yi ∈ R.
Examples
Linear regression: yi = xT
i β + i, with i ∼ N(0, σ2
);
Heterschedastic errors: yi = xT
i β + i(xi), with i(xi) ∼ N(0, σ2
(xi))
Mean regression: yi = g(xi) + i, for some unknown function g, for
instance having the form g(xi) = h(xi)T
β.

The motorcycle accident dataset
−100
−50
0
50
0 20 40 60
times (millisecond)
accel(g)

Desiderata
We are looking for a regression model that:
provides a ﬂexible estimate of the mean process (e.g. not linear);
adaptively quantify the uncertainty of the process as the covariates vary
(e.g. not homoschedastic);
can be easily extended to the multivariate setting xi = (xi1, . . . , xip)T
;
is computationally feasible for large n or large p;
has a reasonable interpretation and therefore
can incorporate prior information, if available (e.g. Bayesian).

Bayesian methods for adaptive regression
Bayesian regression model includes
Linear models. Bayesian linear model, Bayesian lasso, Bayesian
elastic-net.
Basis expansion models. e.g. Bayesian splines (regression, penalized
and smoothing splines), Bayesian MARS, wavelets and Gaussian
processes.
Hard partitioning models. e.g. Bayesian CART, Bayesian additive
regression trees (BART), treed Gaussian processes.
Soft partitioning models. e.g. Bayesian mixture of experts, dependent
Dirichlet process, logit stick-breaking process.
Machine learning methods. e.g. support vector machine, Bayesian
neural networks.

Bayesian linear regression
Regression model prior distribution
Independently for i = 1, . . . , n we let
yi | xi, β, σ2
∼ N(xT
i β, σ2
), β ∼ Np(b, B), σ−2
∼ Gamma(aσ, bσ).
Full conditionals
Let y = (y1, . . . , yn) and X is the design matrix including all the xi, then
the Gibbs sampling alternates between
β | − ∼ Np(˜b, ˜B), ˜b = ˜B B−1
b +
1
σ2
XT
y , ˜B = B−1
+
1
σ2
XT
X
−1
,
and
σ−2
| − ∼ Gamma(˜aσ,˜bσ), ˜aσ = aσ +
n
2
, bσ = bσ +
1
2
(y − Xβ)T
(y − Xβ).

Bayesian regression with basis expansions
Weighted sum of basis functions
The linear model can be easily extended to
yi | h(xi), β, σ2
∼ N(h(xi)T
β, σ2
), β ∼ NM (b, B), σ−2
∼ Gamma(aσ, bσ).
where the vector h(xi) = (h1(xi), . . . , hM (xi))T
is a prespeciﬁed set of
basis functions, such that hm(·) : Rp
→ R, for m = 1, . . . , M.
Basis expansion when p = 1
Possibile choices for the speciﬁed basis are
Polynomial basis expansion: h(xi) = (xi, x2
i , . . . , xM
i )T
.
Splines basis (regression splines, penalized splines, smoothing splines).
Wavelets.
Gaussian radial basis functions.

P-splines: posterior mean

A limitation of Bayesian P-splines
Posterior variance of the linear predictor
Suppose σ2
is treated as known and let H be the design matrix induced by
h(xi). Then, the posterior variance of ηi = h(xi)T
β is
Var(ηi | y) = h(xi)T ˜Bh(xi), with ˜B = B−1
+
1
σ2
HT
H
−1
.
Thus, the posterior variability is not constant over xi, but it does not
depend on the response, being obtained as a function of H, B, and σ2
.
But σ2
is indeed unknown...
Even if we put a prior on σ2
, the posterior variance will indeed depend on
the data, but only through σ2
, which will control the global variability and
will not capture local variability.

Gaussian Processes
Gaussian processes
Definition (Rasmussen and Williams, 2006)
A Gaussian process f(x), or f for short, is a stochastic process defined on
X = Rp
such that all the finite-dimensional distribution have joint
Gaussian distribution.
A Gaussian process is completely specified by its mean function
m(x) = E(f(x)), ∀x ∈ X,
and the covariance function
k(x, x ) = Cov(f(x), f(x )), ∀x, x ∈ X.
We will write
f ∼ GP(m(x), k(x, x )).

Gaussian Processes
Gaussian processes
Finite dimensional distributions
By deﬁnition, for each ﬁnite collection of points x = (x1, . . . , xn) all in X we
have that the vector f(x) = (f(x1), . . . , f(xn)) is distributed according to
f(x) ∼ Nn(m(x), K(x, x)),
where m(x) is the n-dimensional mean vector generated by the mean
function and K(x, x) is a n × n covariance matrix generated by k(·, ·) by
setting [K(x, x)]ij = k(xi, xj), for i, j = 1, . . . , n.
Notation
Let x and x be two collections of points in X and let f(x) and f(x ) be the
associated random variables. We let
Cov (f(x), f(x )) = K(x, x ).

Gaussian Processes
Covariance functions
Positive semi-deﬁnite functions
For k(·, ·) to be a valid covariance function it has to be a symmetric
positive semi-deﬁnite function, that is
k(x, x ) = k(x , x), ∀x, x ∈ X
and for any choice of n, α ∈ Rn
, and for any x = (x1, . . . , xn), it should hold
n
i=1
n
j=1
αiαjk(xi, xj) = αT
K(x, x)α ≥ 0.

Gaussian Processes
Covariance functions
Stationarity
A covariance function k(·, ·) is called stationary if is invariant under any
arbitrary translation t ∈ X, so that
k(x + t, x + t) = k(x, x ), ∀x, x ∈ X,
that is, it is a function of x − x only. A Gaussian process with constant
mean function is strictly stationary if its covariance function is stationary.
Isotropy
A stationary covariance function is called isotropic if it is a function only of
the euclidean distance between x and x , that is
k(x, x ) = k(r), r = ||x − x ||, ∀x, x ∈ X.

Gaussian Processes
Examples of covariance functions
Squared exponential
The squared exponential covariance function has the form
kSE(r) = τ2
exp −
r2
2l2
, τ, l > 0.
Although such a covariance function makes the process very smooth, it is
the most widely-used kernel within the kernel machines ﬁeld.
Matern class of covariance functions
kMat(r) = τ2 21−ν
Γ(ν)
√
2νr
l
ν
Kν
√
2νr
l
,
with positive parameters ν and l, where Kν is a modiﬁed Bessel function.
For ν → ∞ the squared exponential is recovered.

Gaussian Processes
Relationship with other methods
A Bayesian linear model with Gaussian parameters is a degenerate GP,
having
m(x) = h(x)T
b, k(x, x ) = h(x)T
Bh(x).
A GP is said non degenerate if its covariance function is positive deﬁnite.
Smoothing splines (Wahba, 1978)
Smoothing splines can be recasted as a regression with a partially proper
GP prior with a speciﬁc covariance matrix.
SVM and neural networks
Further connection with Support Vector Machine and Bayesian neural
networks can be estabilished (Neal 1996, Rasmussen and WIlliams 2006).

Gaussian Processes
A naive Bayesian model via GPs
We assume that the response y depends on multivariate covariates x
through the following functional specification
y(x) = f(x) + (x), ∀x ∈ X,
where f(x) represents the signal and (x) the noise.
Despite the functional specification, we can observe only a finite number
values (xi, yi). Assuming, additionally, that ∼ GP(0, k (x, x )) with
k (x, x ) = σ2
I(x = x ), we obtain
yi | f(xi), σ2 ind
∼ N(f(xi), σ2
), i = 1, . . . , n,
independently. The elicitation is complete by specifying a functional prior
f ∼ GP(0, k(x, x )) and an inverse gamma for σ2
as before.

Gaussian Processes
Posterior inference via Gibbs sampling
Full conditionals (kriging equations)
The Gibbs sampling alternates between
f(x) | y, σ2
∼ Nn( ˜m(x), ˜K(x, x)), and σ−2
| y, f(x) ∼ Gamma(˜aσ,˜bσ),
where
˜m(x) = K(x, x)(K(x, x) + σ2
In)−1
y =
1
σ2
˜K(x, x)y
and
˜K(x, x)) = K(x, x)−1
+
1
σ2
In
−1
= K(x, x) − K(x, x)(K(x, x) + σ2
In)−1
K(x, x)

Gaussian Processes
A more complete speciﬁcation
Incorporating basis expansions (Blight and Ott, 1975, O’Hagan, 1978)
We assume that
y(x) = h(x)T
β + f(x) + (x), ∀x ∈ X,
where h(x)T
β and f(x) represents the signals and (x) the noise.
Moreover, let f(x) ∼ GP(0, τ2
k∗
(x, x | l)), where k∗
(x, x | l) is a correlation
function, and ∼ GP(0, k (x, x | g)) with k (x, x | g) = gτ2
I(x = x ), so that
y(x) | β, τ2
, g ∼ GP h(x)T
β, τ2
k(x, x | g, l) .
where k(x, x | g, l) = k∗
(x, x | l) + k (x, x | g). Finally, we let
β | τ2
∼ NM (b, τ2
B), τ−2
∼ Gamma(aτ , bτ ).

Gaussian Processes
Some difficulties
Covariance parameters (g, l)
The parameters (g, l) play a crucial roles in fitting a GPs but their prior
elicitation is more delicate. Gramacy and Lee (2008) propose:
p(g, l) = p(g)p(l) = p(g)
1
2
Gamma(l | 1, 20) +
1
2
Gamma(l | 10, 10) ,
with p(g) = Exp(g | ag).
Metropolis-Hastings within Gibbs for (g, l)
Analytically integrating out β and τ2
gives a marginal posterior for
K(x, x | g, l) (Berger et al., 2001), that can be used to obtain efficient MH
draws.

Gaussian Processes
Prior distribution p(l)
0.0
2.5
5.0
7.5
10.0
0.0 0.5 1.0 1.5 2.0
l
p(l)

Gaussian Processes
Application to the motorcycle dataset
−100
0
0 20 40 60
times (millisecond)
accel(g)

Gaussian Processes
Brief summary about Gaussian Processes
Advantages
GPs are a powerful tool for nonparametric regression.
GPs are conceptually straightforward and can easily accomodate prior
knowledge.
Uncertainty quantiﬁcation, e.g. posterior credible intervals, can be easily
taken into account.
Disadvantages
GP models are usually “stationary”, meaning that the same covariance
matrix is used throughout X, which may be a strong assumption.
Moreover, non stationary models are often computationally intractable.
Although some fast approximations exist, ﬁtting a GP usually require the
inversion of n × n matrixes, which has a computing time of O(n3
).

Bayesian CART
Classification and Regression Trees (CART)
Regression trees (Breiman et al. 1984)
CART models are a regression method that recursively partition the
predictor space into subsets, usually through a greedy algorithm, so that
the distribution of y is more and more homogeneous.
...and related methods
Modifications and extensions of CART (e.g. MARS, random forest,
AdaBoosting and gradient boosting), are perhaps one of the most widely
used tool for regression in the machine learning community.
Bayesian CART (Chipman et al. 1998, Denison et al. 1998)
Bayesian modifications of CART were later proposed, introducing a prior
distribution on the partition of the covariates.

Bayesian CART
Bayesian CART model
A binary tree T subdivides X into S non-overlapping regions {R1, . . . , RS},
so that X =
S
s=1 Rs. This is obtained recursively, splitting at each step a
previously obtained region into sub-regions. Each region Rs contains data
(Xs, ys), comprising a total of ns observations, for s = 1, . . . , S.
Conditionally to the tree structure, the CART model assumes independent
observations among regions and observations, that is
yis | θ, T
ind
∼ N(yis | µs, σ2
s ), i = 1 . . . , ns, s = 1, . . . , S.
Prior distribution for θ = (µ, σ2
) are given conditionally to T , i.e. following
the standard Gaussian - inverse gamma speciﬁcation.

Bayesian CART
An example of tree partitioning
x1 <> 24.2
x1 <> 13.8
1e−04
20 obs
1 x1 <> 17.6
0.0074
11 obs
2
0.0033
15 obs
3
x1 <> 38
0.0724
28 obs
4
0.0053
20 obs
5
A tree of height=4, log(p)=107.594

Bayesian CART
A tree prior for T
An implicitely deﬁned tree prior (Chipman et al., 1998)
The prior stochastic process for T is described here in a recursive manner.
Begin by setting T to be the trivial tree with a single region R = X.
Each terminal region Rs splits in Rs1 ∪ Rs2 with probability a(1 + qRs )−b
,
where qRs
is the depth of Rs, i.e. the number of splits above Rs. The
split rule consists is chosing randomly and uniformly among the values
of the observed covariates X.
If new regions are created, repeat step 2 until the process stops.

Bayesian CART
Prior distribution: # of terminal nodes0.00.20.4
alpha=0.5 and beta=0.5
Number of terminal nodes
Probability
1 3 5 7 9 11 13 15 17 19 21
0.000.040.08
Probability
1 6 12 19 26 33 40 47 54 61 78
0.000.100.20
alpha=0.95 and beta=1
Probability
1 3 5 7 9 11 13 15 17 19
0.00.20.4
Probability
1 2 3 4 5 6 7 8 9 10 12 15

Bayesian CART
Posterior inference
A RJ-MCMC algorithm (Chipman et al. 1998)
A reversible jump Metropolis-Hastings algorithm is used for posterior
computation, which involves the following reversible steps
GROW. Randomly pick a terminal node and split it into two new ones.
PRUNE. Randomly pick a parent of two terminal nodes and turn it into a
terminal node.
CHANGE. Randomly pick an internal node and randomly reassign it a
splitting rule.
SWAP. Randomly pick a parent-child pair and swap their splitting rule.

Bayesian CART
−100
−50
0
50
0 20 40 60
times (millisecond)
accel(g)

Bayesian CART
Limitations and extensions
Potential limitations
Slow mixing. The entire MCMC procedure get easily stuck into some
local mode and therefore it is recommended to restart the chain several
times to explore diﬀerent solutions.
Diﬃculties in capturing smooth or even linear behaviour via piecewise
constant functions.
Extensions
A treed Bayesian linear model was proposed by Chipman et al. (2002), in
which at each terminal node is assumed to be a linear model.
An additive Bayesian model based on Bayesian CART, called BART, was
proposed by Chipman et al. (2010), which mostly solves the issues
underlined above.

Treed Gaussian Process Model
Treed Gaussian Processes
The conditional model (Gramacy and Lee 2008)
Conditionally to a tree structure T , the treed Gaussian process model
assumes
y(x) = h(x)T
βs + fs(x) + s(x), ∀x ∈ Rs, s = 1, . . . , S,
fs(x)
ind
∼ GP(0, τ2
s k∗
(x, x | ls)), s
ind
∼ GP(0, k (x, x | gs)),
where h(x) = (1, x)T
, k∗
(x, x | ls) and k (x, x | gs) = gsτ2
s I(x = x ) are
correlations functions. Moreover, we have
βs | bs, τ2
s , κ2
s, B
ind
∼ NM (bs, τ2
s κ2
sB).
For the tree structure T , the same prior as in Chipman et al. (1998) is
assumed.

More on prior elicitation
Hyperpriors...
The model elicitation is completed by assuming
bs
i.i.d
∼ NM (b0, B0), B−1
s
i.i.d
∼ W((ρV )−1
, ρ),
τ−2
s
i.i.d
∼ Gamma(aτ , bτ ), κ−2
s
i.i.d
∼ Gamma(aκ, bκ),
where W represent the Wishart distribution. Mixture priors for the
parameters (gs, ls), for s = 1, . . . , S are also assumed, similarly to what
discussed before.
...and hyperparameters
Default values for the hyperparameters of the tree are suggested by the
authors and setted equal to a = .5 and b = 2.

Posterior inference
Posterior inference resembles the RJ-MCMC of Chipman et al. (1998),
with an additional operation called ROTATE, which should improve the
mixing, by providing a more dynamic set of candidate nodes for pruning.
Conditionally to the tree structure, full conditional inference for most of
the involved parameters is available, independently for each region, so
that a Gibbs sampler can be setted up.
Metropolis-Hastings within Gibbs steps are required for each pair (gs, ls).
The usual predictive (kriging) equations are also available, conditionally
to each region Rs.

Implementation and software
Software (Gramacy 2007)
An R package called tgp was made available, which can handle all the
models discussed so far and even more, for instance allowing for
parallelization of the MCMC chain.
The tree GP model was coded in a mixture of C and C++, then wrapped by
R commands. Centering and rescaling the input is reccommended by the
authors, so that defualt Metropolis-Hastings proposal distributions can be
used.

−100
0
0 20 40 60
times (millisecond)
accel(g)

Discussion and possible extensions
Limiting linear models (Gramacy and Lee 2008)
In some cases a GP may not be needed within a partition, and a much
simpler model, such as the linear model, may suffice. Linear models can
be viewed as a particular case of a GP, and a model-switching prior
distribution for the hyperparameters (g, l) would allow the practical
implementation.
Treed GP for classification (Broderick and Gramacy, 2010)
Treed Gaussian processes could also be used when y ∈ {0, 1}, therefore
assuming a flexible representation for P(yi = 1 | xi).
A more complex input space X (Broderick and Gramacy, 2011)
So far we have assumed that X = Rp
, but more complex input space could
be of interest, for instance when some of the inputs are categorical.

Bayesian regression models and treed Gaussian process models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian regression models and treed Gaussian process models

Similar to Bayesian regression models and treed Gaussian process models (20)

Recently uploaded

Recently uploaded (20)

Bayesian regression models and treed Gaussian process models