SlideShare a Scribd company logo
1 of 36
Download to read offline
Bayesian regression models and treed
Gaussian process models
Tommaso Rigon
22-03-2017
Tommaso Rigon TGP 2017 1 / 36
A general regression problem
Let (xi, yi), for i = 1, . . . , n, denotes the collection of a set covariates with
its response variable, respectively. Suppose that
yi | xi
ind
∼ f(yi | xi),
independently for i = 1, . . . , n according to some unknown distribution
function f(yi | xi), with yi ∈ R.
Examples
Linear regression: yi = xT
i β + i, with i ∼ N(0, σ2
);
Heterschedastic errors: yi = xT
i β + i(xi), with i(xi) ∼ N(0, σ2
(xi))
Mean regression: yi = g(xi) + i, for some unknown function g, for
instance having the form g(xi) = h(xi)T
β.
Tommaso Rigon TGP 2017 2 / 36
The motorcycle accident dataset
−100
−50
0
50
0 20 40 60
times (millisecond)
accel(g)
Tommaso Rigon TGP 2017 3 / 36
Desiderata
We are looking for a regression model that:
provides a flexible estimate of the mean process (e.g. not linear);
adaptively quantify the uncertainty of the process as the covariates vary
(e.g. not homoschedastic);
can be easily extended to the multivariate setting xi = (xi1, . . . , xip)T
;
is computationally feasible for large n or large p;
has a reasonable interpretation and therefore
can incorporate prior information, if available (e.g. Bayesian).
Tommaso Rigon TGP 2017 4 / 36
Bayesian methods for adaptive regression
Bayesian regression model includes
Linear models. Bayesian linear model, Bayesian lasso, Bayesian
elastic-net.
Basis expansion models. e.g. Bayesian splines (regression, penalized
and smoothing splines), Bayesian MARS, wavelets and Gaussian
processes.
Hard partitioning models. e.g. Bayesian CART, Bayesian additive
regression trees (BART), treed Gaussian processes.
Soft partitioning models. e.g. Bayesian mixture of experts, dependent
Dirichlet process, logit stick-breaking process.
Machine learning methods. e.g. support vector machine, Bayesian
neural networks.
Tommaso Rigon TGP 2017 5 / 36
Bayesian linear regression
Bayesian linear regression
Regression model prior distribution
Independently for i = 1, . . . , n we let
yi | xi, β, σ2
∼ N(xT
i β, σ2
), β ∼ Np(b, B), σ−2
∼ Gamma(aσ, bσ).
Full conditionals
Let y = (y1, . . . , yn) and X is the design matrix including all the xi, then
the Gibbs sampling alternates between
β | − ∼ Np(˜b, ˜B), ˜b = ˜B B−1
b +
1
σ2
XT
y , ˜B = B−1
+
1
σ2
XT
X
−1
,
and
σ−2
| − ∼ Gamma(˜aσ,˜bσ), ˜aσ = aσ +
n
2
, bσ = bσ +
1
2
(y − Xβ)T
(y − Xβ).
Tommaso Rigon TGP 2017 6 / 36
Bayesian linear regression
Bayesian regression with basis expansions
Weighted sum of basis functions
The linear model can be easily extended to
yi | h(xi), β, σ2
∼ N(h(xi)T
β, σ2
), β ∼ NM (b, B), σ−2
∼ Gamma(aσ, bσ).
where the vector h(xi) = (h1(xi), . . . , hM (xi))T
is a prespecified set of
basis functions, such that hm(·) : Rp
→ R, for m = 1, . . . , M.
Basis expansion when p = 1
Possibile choices for the specified basis are
Polynomial basis expansion: h(xi) = (xi, x2
i , . . . , xM
i )T
.
Splines basis (regression splines, penalized splines, smoothing splines).
Wavelets.
Gaussian radial basis functions.
Tommaso Rigon TGP 2017 7 / 36
Bayesian linear regression
P-splines: posterior mean
Tommaso Rigon TGP 2017 8 / 36
Bayesian linear regression
A limitation of Bayesian P-splines
Posterior variance of the linear predictor
Suppose σ2
is treated as known and let H be the design matrix induced by
h(xi). Then, the posterior variance of ηi = h(xi)T
β is
Var(ηi | y) = h(xi)T ˜Bh(xi), with ˜B = B−1
+
1
σ2
HT
H
−1
.
Thus, the posterior variability is not constant over xi, but it does not
depend on the response, being obtained as a function of H, B, and σ2
.
But σ2
is indeed unknown...
Even if we put a prior on σ2
, the posterior variance will indeed depend on
the data, but only through σ2
, which will control the global variability and
will not capture local variability.
Tommaso Rigon TGP 2017 9 / 36
Gaussian Processes
Gaussian processes
Definition (Rasmussen and Williams, 2006)
A Gaussian process f(x), or f for short, is a stochastic process defined on
X = Rp
such that all the finite-dimensional distribution have joint
Gaussian distribution.
A Gaussian process is completely specified by its mean function
m(x) = E(f(x)), ∀x ∈ X,
and the covariance function
k(x, x ) = Cov(f(x), f(x )), ∀x, x ∈ X.
We will write
f ∼ GP(m(x), k(x, x )).
Tommaso Rigon TGP 2017 10 / 36
Gaussian Processes
Gaussian processes
Finite dimensional distributions
By definition, for each finite collection of points x = (x1, . . . , xn) all in X we
have that the vector f(x) = (f(x1), . . . , f(xn)) is distributed according to
f(x) ∼ Nn(m(x), K(x, x)),
where m(x) is the n-dimensional mean vector generated by the mean
function and K(x, x) is a n × n covariance matrix generated by k(·, ·) by
setting [K(x, x)]ij = k(xi, xj), for i, j = 1, . . . , n.
Notation
Let x and x be two collections of points in X and let f(x) and f(x ) be the
associated random variables. We let
Cov (f(x), f(x )) = K(x, x ).
Tommaso Rigon TGP 2017 11 / 36
Gaussian Processes
Covariance functions
Positive semi-definite functions
For k(·, ·) to be a valid covariance function it has to be a symmetric
positive semi-definite function, that is
k(x, x ) = k(x , x), ∀x, x ∈ X
and for any choice of n, α ∈ Rn
, and for any x = (x1, . . . , xn), it should hold
n
i=1
n
j=1
αiαjk(xi, xj) = αT
K(x, x)α ≥ 0.
Tommaso Rigon TGP 2017 12 / 36
Gaussian Processes
Covariance functions
Stationarity
A covariance function k(·, ·) is called stationary if is invariant under any
arbitrary translation t ∈ X, so that
k(x + t, x + t) = k(x, x ), ∀x, x ∈ X,
that is, it is a function of x − x only. A Gaussian process with constant
mean function is strictly stationary if its covariance function is stationary.
Isotropy
A stationary covariance function is called isotropic if it is a function only of
the euclidean distance between x and x , that is
k(x, x ) = k(r), r = ||x − x ||, ∀x, x ∈ X.
Tommaso Rigon TGP 2017 13 / 36
Gaussian Processes
Examples of covariance functions
Squared exponential
The squared exponential covariance function has the form
kSE(r) = τ2
exp −
r2
2l2
, τ, l > 0.
Although such a covariance function makes the process very smooth, it is
the most widely-used kernel within the kernel machines field.
Matern class of covariance functions
kMat(r) = τ2 21−ν
Γ(ν)
√
2νr
l
ν
Kν
√
2νr
l
,
with positive parameters ν and l, where Kν is a modified Bessel function.
For ν → ∞ the squared exponential is recovered.
Tommaso Rigon TGP 2017 14 / 36
Gaussian Processes
Relationship with other methods
Bayesian linear regression
A Bayesian linear model with Gaussian parameters is a degenerate GP,
having
m(x) = h(x)T
b, k(x, x ) = h(x)T
Bh(x).
A GP is said non degenerate if its covariance function is positive definite.
Smoothing splines (Wahba, 1978)
Smoothing splines can be recasted as a regression with a partially proper
GP prior with a specific covariance matrix.
SVM and neural networks
Further connection with Support Vector Machine and Bayesian neural
networks can be estabilished (Neal 1996, Rasmussen and WIlliams 2006).
Tommaso Rigon TGP 2017 15 / 36
Gaussian Processes
A naive Bayesian model via GPs
We assume that the response y depends on multivariate covariates x
through the following functional specification
y(x) = f(x) + (x), ∀x ∈ X,
where f(x) represents the signal and (x) the noise.
Despite the functional specification, we can observe only a finite number
values (xi, yi). Assuming, additionally, that ∼ GP(0, k (x, x )) with
k (x, x ) = σ2
I(x = x ), we obtain
yi | f(xi), σ2 ind
∼ N(f(xi), σ2
), i = 1, . . . , n,
independently. The elicitation is complete by specifying a functional prior
f ∼ GP(0, k(x, x )) and an inverse gamma for σ2
as before.
Tommaso Rigon TGP 2017 16 / 36
Gaussian Processes
Posterior inference via Gibbs sampling
Full conditionals (kriging equations)
The Gibbs sampling alternates between
f(x) | y, σ2
∼ Nn( ˜m(x), ˜K(x, x)), and σ−2
| y, f(x) ∼ Gamma(˜aσ,˜bσ),
where
˜m(x) = K(x, x)(K(x, x) + σ2
In)−1
y =
1
σ2
˜K(x, x)y
and
˜K(x, x)) = K(x, x)−1
+
1
σ2
In
−1
= K(x, x) − K(x, x)(K(x, x) + σ2
In)−1
K(x, x)
Tommaso Rigon TGP 2017 17 / 36
Gaussian Processes
A more complete specification
Incorporating basis expansions (Blight and Ott, 1975, O’Hagan, 1978)
We assume that
y(x) = h(x)T
β + f(x) + (x), ∀x ∈ X,
where h(x)T
β and f(x) represents the signals and (x) the noise.
Moreover, let f(x) ∼ GP(0, τ2
k∗
(x, x | l)), where k∗
(x, x | l) is a correlation
function, and ∼ GP(0, k (x, x | g)) with k (x, x | g) = gτ2
I(x = x ), so that
y(x) | β, τ2
, g ∼ GP h(x)T
β, τ2
k(x, x | g, l) .
where k(x, x | g, l) = k∗
(x, x | l) + k (x, x | g). Finally, we let
β | τ2
∼ NM (b, τ2
B), τ−2
∼ Gamma(aτ , bτ ).
Tommaso Rigon TGP 2017 18 / 36
Gaussian Processes
Some difficulties
Covariance parameters (g, l)
The parameters (g, l) play a crucial roles in fitting a GPs but their prior
elicitation is more delicate. Gramacy and Lee (2008) propose:
p(g, l) = p(g)p(l) = p(g)
1
2
Gamma(l | 1, 20) +
1
2
Gamma(l | 10, 10) ,
with p(g) = Exp(g | ag).
Metropolis-Hastings within Gibbs for (g, l)
Analytically integrating out β and τ2
gives a marginal posterior for
K(x, x | g, l) (Berger et al., 2001), that can be used to obtain efficient MH
draws.
Tommaso Rigon TGP 2017 19 / 36
Gaussian Processes
Prior distribution p(l)
0.0
2.5
5.0
7.5
10.0
0.0 0.5 1.0 1.5 2.0
l
p(l)
Tommaso Rigon TGP 2017 20 / 36
Gaussian Processes
Application to the motorcycle dataset
−100
0
0 20 40 60
times (millisecond)
accel(g)
Tommaso Rigon TGP 2017 21 / 36
Gaussian Processes
Brief summary about Gaussian Processes
Advantages
GPs are a powerful tool for nonparametric regression.
GPs are conceptually straightforward and can easily accomodate prior
knowledge.
Uncertainty quantification, e.g. posterior credible intervals, can be easily
taken into account.
Disadvantages
GP models are usually “stationary”, meaning that the same covariance
matrix is used throughout X, which may be a strong assumption.
Moreover, non stationary models are often computationally intractable.
Although some fast approximations exist, fitting a GP usually require the
inversion of n × n matrixes, which has a computing time of O(n3
).
Tommaso Rigon TGP 2017 22 / 36
Bayesian CART
Classification and Regression Trees (CART)
Regression trees (Breiman et al. 1984)
CART models are a regression method that recursively partition the
predictor space into subsets, usually through a greedy algorithm, so that
the distribution of y is more and more homogeneous.
...and related methods
Modifications and extensions of CART (e.g. MARS, random forest,
AdaBoosting and gradient boosting), are perhaps one of the most widely
used tool for regression in the machine learning community.
Bayesian CART (Chipman et al. 1998, Denison et al. 1998)
Bayesian modifications of CART were later proposed, introducing a prior
distribution on the partition of the covariates.
Tommaso Rigon TGP 2017 23 / 36
Bayesian CART
Bayesian CART model
A binary tree T subdivides X into S non-overlapping regions {R1, . . . , RS},
so that X =
S
s=1 Rs. This is obtained recursively, splitting at each step a
previously obtained region into sub-regions. Each region Rs contains data
(Xs, ys), comprising a total of ns observations, for s = 1, . . . , S.
Conditionally to the tree structure, the CART model assumes independent
observations among regions and observations, that is
yis | θ, T
ind
∼ N(yis | µs, σ2
s ), i = 1 . . . , ns, s = 1, . . . , S.
Prior distribution for θ = (µ, σ2
) are given conditionally to T , i.e. following
the standard Gaussian - inverse gamma specification.
Tommaso Rigon TGP 2017 24 / 36
Bayesian CART
An example of tree partitioning
x1 <> 24.2
x1 <> 13.8
1e−04
20 obs
1 x1 <> 17.6
0.0074
11 obs
2
0.0033
15 obs
3
x1 <> 38
0.0724
28 obs
4
0.0053
20 obs
5
A tree of height=4, log(p)=107.594
Tommaso Rigon TGP 2017 25 / 36
Bayesian CART
A tree prior for T
An implicitely defined tree prior (Chipman et al., 1998)
The prior stochastic process for T is described here in a recursive manner.
Begin by setting T to be the trivial tree with a single region R = X.
Each terminal region Rs splits in Rs1 ∪ Rs2 with probability a(1 + qRs )−b
,
where qRs
is the depth of Rs, i.e. the number of splits above Rs. The
split rule consists is chosing randomly and uniformly among the values
of the observed covariates X.
If new regions are created, repeat step 2 until the process stops.
Tommaso Rigon TGP 2017 26 / 36
Bayesian CART
Prior distribution: # of terminal nodes0.00.20.4
alpha=0.5 and beta=0.5
Number of terminal nodes
Probability
1 3 5 7 9 11 13 15 17 19 21
0.000.040.08
alpha=0.95 and beta=0.5
Number of terminal nodes
Probability
1 6 12 19 26 33 40 47 54 61 78
0.000.100.20
alpha=0.95 and beta=1
Number of terminal nodes
Probability
1 3 5 7 9 11 13 15 17 19
0.00.20.4
alpha=0.95 and beta=1.5
Number of terminal nodes
Probability
1 2 3 4 5 6 7 8 9 10 12 15
Tommaso Rigon TGP 2017 27 / 36
Bayesian CART
Posterior inference
A RJ-MCMC algorithm (Chipman et al. 1998)
A reversible jump Metropolis-Hastings algorithm is used for posterior
computation, which involves the following reversible steps
GROW. Randomly pick a terminal node and split it into two new ones.
PRUNE. Randomly pick a parent of two terminal nodes and turn it into a
terminal node.
CHANGE. Randomly pick an internal node and randomly reassign it a
splitting rule.
SWAP. Randomly pick a parent-child pair and swap their splitting rule.
Tommaso Rigon TGP 2017 28 / 36
Bayesian CART
Application to the motorcycle dataset
−100
−50
0
50
0 20 40 60
times (millisecond)
accel(g)
Tommaso Rigon TGP 2017 29 / 36
Bayesian CART
Limitations and extensions
Potential limitations
Slow mixing. The entire MCMC procedure get easily stuck into some
local mode and therefore it is recommended to restart the chain several
times to explore different solutions.
Difficulties in capturing smooth or even linear behaviour via piecewise
constant functions.
Extensions
A treed Bayesian linear model was proposed by Chipman et al. (2002), in
which at each terminal node is assumed to be a linear model.
An additive Bayesian model based on Bayesian CART, called BART, was
proposed by Chipman et al. (2010), which mostly solves the issues
underlined above.
Tommaso Rigon TGP 2017 30 / 36
Treed Gaussian Process Model
Treed Gaussian Processes
The conditional model (Gramacy and Lee 2008)
Conditionally to a tree structure T , the treed Gaussian process model
assumes
y(x) = h(x)T
βs + fs(x) + s(x), ∀x ∈ Rs, s = 1, . . . , S,
fs(x)
ind
∼ GP(0, τ2
s k∗
(x, x | ls)), s
ind
∼ GP(0, k (x, x | gs)),
where h(x) = (1, x)T
, k∗
(x, x | ls) and k (x, x | gs) = gsτ2
s I(x = x ) are
correlations functions. Moreover, we have
βs | bs, τ2
s , κ2
s, B
ind
∼ NM (bs, τ2
s κ2
sB).
For the tree structure T , the same prior as in Chipman et al. (1998) is
assumed.
Tommaso Rigon TGP 2017 31 / 36
Treed Gaussian Process Model
More on prior elicitation
Hyperpriors...
The model elicitation is completed by assuming
bs
i.i.d
∼ NM (b0, B0), B−1
s
i.i.d
∼ W((ρV )−1
, ρ),
τ−2
s
i.i.d
∼ Gamma(aτ , bτ ), κ−2
s
i.i.d
∼ Gamma(aκ, bκ),
where W represent the Wishart distribution. Mixture priors for the
parameters (gs, ls), for s = 1, . . . , S are also assumed, similarly to what
discussed before.
...and hyperparameters
Default values for the hyperparameters of the tree are suggested by the
authors and setted equal to a = .5 and b = 2.
Tommaso Rigon TGP 2017 32 / 36
Treed Gaussian Process Model
Posterior inference
Posterior inference resembles the RJ-MCMC of Chipman et al. (1998),
with an additional operation called ROTATE, which should improve the
mixing, by providing a more dynamic set of candidate nodes for pruning.
Conditionally to the tree structure, full conditional inference for most of
the involved parameters is available, independently for each region, so
that a Gibbs sampler can be setted up.
Metropolis-Hastings within Gibbs steps are required for each pair (gs, ls).
The usual predictive (kriging) equations are also available, conditionally
to each region Rs.
Tommaso Rigon TGP 2017 33 / 36
Treed Gaussian Process Model
Implementation and software
Software (Gramacy 2007)
An R package called tgp was made available, which can handle all the
models discussed so far and even more, for instance allowing for
parallelization of the MCMC chain.
The tree GP model was coded in a mixture of C and C++, then wrapped by
R commands. Centering and rescaling the input is reccommended by the
authors, so that defualt Metropolis-Hastings proposal distributions can be
used.
Tommaso Rigon TGP 2017 34 / 36
Treed Gaussian Process Model
Application to the motorcycle dataset
−100
0
0 20 40 60
times (millisecond)
accel(g)
Tommaso Rigon TGP 2017 35 / 36
Treed Gaussian Process Model
Discussion and possible extensions
Limiting linear models (Gramacy and Lee 2008)
In some cases a GP may not be needed within a partition, and a much
simpler model, such as the linear model, may suffice. Linear models can
be viewed as a particular case of a GP, and a model-switching prior
distribution for the hyperparameters (g, l) would allow the practical
implementation.
Treed GP for classification (Broderick and Gramacy, 2010)
Treed Gaussian processes could also be used when y ∈ {0, 1}, therefore
assuming a flexible representation for P(yi = 1 | xi).
A more complex input space X (Broderick and Gramacy, 2011)
So far we have assumed that X = Rp
, but more complex input space could
be of interest, for instance when some of the inputs are categorical.
Tommaso Rigon TGP 2017 36 / 36

More Related Content

What's hot

Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksFengtao Wu
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsVjekoslavKovac1
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsVjekoslavKovac1
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...VjekoslavKovac1
 
Multilinear Twisted Paraproducts
Multilinear Twisted ParaproductsMultilinear Twisted Paraproducts
Multilinear Twisted ParaproductsVjekoslavKovac1
 
Norm-variation of bilinear averages
Norm-variation of bilinear averagesNorm-variation of bilinear averages
Norm-variation of bilinear averagesVjekoslavKovac1
 
Boundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductBoundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductVjekoslavKovac1
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling SetsVjekoslavKovac1
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605ketanaka
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problemBigMC
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsVjekoslavKovac1
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
kinks and cusps in the transition dynamics of a bloch state
kinks and cusps in the transition dynamics of a bloch statekinks and cusps in the transition dynamics of a bloch state
kinks and cusps in the transition dynamics of a bloch statejiang-min zhang
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesRene Kotze
 

What's hot (20)

Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular Integrals
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurations
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
 
Multilinear Twisted Paraproducts
Multilinear Twisted ParaproductsMultilinear Twisted Paraproducts
Multilinear Twisted Paraproducts
 
Norm-variation of bilinear averages
Norm-variation of bilinear averagesNorm-variation of bilinear averages
Norm-variation of bilinear averages
 
Boundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductBoundedness of the Twisted Paraproduct
Boundedness of the Twisted Paraproduct
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling Sets
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problem
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
kinks and cusps in the transition dynamics of a bloch state
kinks and cusps in the transition dynamics of a bloch statekinks and cusps in the transition dynamics of a bloch state
kinks and cusps in the transition dynamics of a bloch state
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat Spacetimes
 

Similar to Bayesian regression models and treed Gaussian process models

Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information TheoryJoe Suzuki
 
Logit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataLogit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataTommaso Rigon
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPSSA KPI
 
An Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsAn Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsMary Montoya
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
Introduction to the theory of optimization
Introduction to the theory of optimizationIntroduction to the theory of optimization
Introduction to the theory of optimizationDelta Pi Systems
 
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoSparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoLaboratoire Statistique et génome
 
A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...Alexander Decker
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesPierre Jacob
 
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...SSA KPI
 
Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013Raffaele Rainone
 

Similar to Bayesian regression models and treed Gaussian process models (20)

MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory
 
Logit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataLogit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count data
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
An Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsAn Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming Problems
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
Introduction to the theory of optimization
Introduction to the theory of optimizationIntroduction to the theory of optimization
Introduction to the theory of optimization
 
Hsu etal-2009
Hsu etal-2009Hsu etal-2009
Hsu etal-2009
 
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-LassoSparsity with sign-coherent groups of variables via the cooperative-Lasso
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
 
A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
 
QMC: Operator Splitting Workshop, Boundedness of the Sequence if Iterates Gen...
QMC: Operator Splitting Workshop, Boundedness of the Sequence if Iterates Gen...QMC: Operator Splitting Workshop, Boundedness of the Sequence if Iterates Gen...
QMC: Operator Splitting Workshop, Boundedness of the Sequence if Iterates Gen...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013Rainone - Groups St. Andrew 2013
Rainone - Groups St. Andrew 2013
 

Recently uploaded

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Bayesian regression models and treed Gaussian process models

  • 1. Bayesian regression models and treed Gaussian process models Tommaso Rigon 22-03-2017 Tommaso Rigon TGP 2017 1 / 36
  • 2. A general regression problem Let (xi, yi), for i = 1, . . . , n, denotes the collection of a set covariates with its response variable, respectively. Suppose that yi | xi ind ∼ f(yi | xi), independently for i = 1, . . . , n according to some unknown distribution function f(yi | xi), with yi ∈ R. Examples Linear regression: yi = xT i β + i, with i ∼ N(0, σ2 ); Heterschedastic errors: yi = xT i β + i(xi), with i(xi) ∼ N(0, σ2 (xi)) Mean regression: yi = g(xi) + i, for some unknown function g, for instance having the form g(xi) = h(xi)T β. Tommaso Rigon TGP 2017 2 / 36
  • 3. The motorcycle accident dataset −100 −50 0 50 0 20 40 60 times (millisecond) accel(g) Tommaso Rigon TGP 2017 3 / 36
  • 4. Desiderata We are looking for a regression model that: provides a flexible estimate of the mean process (e.g. not linear); adaptively quantify the uncertainty of the process as the covariates vary (e.g. not homoschedastic); can be easily extended to the multivariate setting xi = (xi1, . . . , xip)T ; is computationally feasible for large n or large p; has a reasonable interpretation and therefore can incorporate prior information, if available (e.g. Bayesian). Tommaso Rigon TGP 2017 4 / 36
  • 5. Bayesian methods for adaptive regression Bayesian regression model includes Linear models. Bayesian linear model, Bayesian lasso, Bayesian elastic-net. Basis expansion models. e.g. Bayesian splines (regression, penalized and smoothing splines), Bayesian MARS, wavelets and Gaussian processes. Hard partitioning models. e.g. Bayesian CART, Bayesian additive regression trees (BART), treed Gaussian processes. Soft partitioning models. e.g. Bayesian mixture of experts, dependent Dirichlet process, logit stick-breaking process. Machine learning methods. e.g. support vector machine, Bayesian neural networks. Tommaso Rigon TGP 2017 5 / 36
  • 6. Bayesian linear regression Bayesian linear regression Regression model prior distribution Independently for i = 1, . . . , n we let yi | xi, β, σ2 ∼ N(xT i β, σ2 ), β ∼ Np(b, B), σ−2 ∼ Gamma(aσ, bσ). Full conditionals Let y = (y1, . . . , yn) and X is the design matrix including all the xi, then the Gibbs sampling alternates between β | − ∼ Np(˜b, ˜B), ˜b = ˜B B−1 b + 1 σ2 XT y , ˜B = B−1 + 1 σ2 XT X −1 , and σ−2 | − ∼ Gamma(˜aσ,˜bσ), ˜aσ = aσ + n 2 , bσ = bσ + 1 2 (y − Xβ)T (y − Xβ). Tommaso Rigon TGP 2017 6 / 36
  • 7. Bayesian linear regression Bayesian regression with basis expansions Weighted sum of basis functions The linear model can be easily extended to yi | h(xi), β, σ2 ∼ N(h(xi)T β, σ2 ), β ∼ NM (b, B), σ−2 ∼ Gamma(aσ, bσ). where the vector h(xi) = (h1(xi), . . . , hM (xi))T is a prespecified set of basis functions, such that hm(·) : Rp → R, for m = 1, . . . , M. Basis expansion when p = 1 Possibile choices for the specified basis are Polynomial basis expansion: h(xi) = (xi, x2 i , . . . , xM i )T . Splines basis (regression splines, penalized splines, smoothing splines). Wavelets. Gaussian radial basis functions. Tommaso Rigon TGP 2017 7 / 36
  • 8. Bayesian linear regression P-splines: posterior mean Tommaso Rigon TGP 2017 8 / 36
  • 9. Bayesian linear regression A limitation of Bayesian P-splines Posterior variance of the linear predictor Suppose σ2 is treated as known and let H be the design matrix induced by h(xi). Then, the posterior variance of ηi = h(xi)T β is Var(ηi | y) = h(xi)T ˜Bh(xi), with ˜B = B−1 + 1 σ2 HT H −1 . Thus, the posterior variability is not constant over xi, but it does not depend on the response, being obtained as a function of H, B, and σ2 . But σ2 is indeed unknown... Even if we put a prior on σ2 , the posterior variance will indeed depend on the data, but only through σ2 , which will control the global variability and will not capture local variability. Tommaso Rigon TGP 2017 9 / 36
  • 10. Gaussian Processes Gaussian processes Definition (Rasmussen and Williams, 2006) A Gaussian process f(x), or f for short, is a stochastic process defined on X = Rp such that all the finite-dimensional distribution have joint Gaussian distribution. A Gaussian process is completely specified by its mean function m(x) = E(f(x)), ∀x ∈ X, and the covariance function k(x, x ) = Cov(f(x), f(x )), ∀x, x ∈ X. We will write f ∼ GP(m(x), k(x, x )). Tommaso Rigon TGP 2017 10 / 36
  • 11. Gaussian Processes Gaussian processes Finite dimensional distributions By definition, for each finite collection of points x = (x1, . . . , xn) all in X we have that the vector f(x) = (f(x1), . . . , f(xn)) is distributed according to f(x) ∼ Nn(m(x), K(x, x)), where m(x) is the n-dimensional mean vector generated by the mean function and K(x, x) is a n × n covariance matrix generated by k(·, ·) by setting [K(x, x)]ij = k(xi, xj), for i, j = 1, . . . , n. Notation Let x and x be two collections of points in X and let f(x) and f(x ) be the associated random variables. We let Cov (f(x), f(x )) = K(x, x ). Tommaso Rigon TGP 2017 11 / 36
  • 12. Gaussian Processes Covariance functions Positive semi-definite functions For k(·, ·) to be a valid covariance function it has to be a symmetric positive semi-definite function, that is k(x, x ) = k(x , x), ∀x, x ∈ X and for any choice of n, α ∈ Rn , and for any x = (x1, . . . , xn), it should hold n i=1 n j=1 αiαjk(xi, xj) = αT K(x, x)α ≥ 0. Tommaso Rigon TGP 2017 12 / 36
  • 13. Gaussian Processes Covariance functions Stationarity A covariance function k(·, ·) is called stationary if is invariant under any arbitrary translation t ∈ X, so that k(x + t, x + t) = k(x, x ), ∀x, x ∈ X, that is, it is a function of x − x only. A Gaussian process with constant mean function is strictly stationary if its covariance function is stationary. Isotropy A stationary covariance function is called isotropic if it is a function only of the euclidean distance between x and x , that is k(x, x ) = k(r), r = ||x − x ||, ∀x, x ∈ X. Tommaso Rigon TGP 2017 13 / 36
  • 14. Gaussian Processes Examples of covariance functions Squared exponential The squared exponential covariance function has the form kSE(r) = τ2 exp − r2 2l2 , τ, l > 0. Although such a covariance function makes the process very smooth, it is the most widely-used kernel within the kernel machines field. Matern class of covariance functions kMat(r) = τ2 21−ν Γ(ν) √ 2νr l ν Kν √ 2νr l , with positive parameters ν and l, where Kν is a modified Bessel function. For ν → ∞ the squared exponential is recovered. Tommaso Rigon TGP 2017 14 / 36
  • 15. Gaussian Processes Relationship with other methods Bayesian linear regression A Bayesian linear model with Gaussian parameters is a degenerate GP, having m(x) = h(x)T b, k(x, x ) = h(x)T Bh(x). A GP is said non degenerate if its covariance function is positive definite. Smoothing splines (Wahba, 1978) Smoothing splines can be recasted as a regression with a partially proper GP prior with a specific covariance matrix. SVM and neural networks Further connection with Support Vector Machine and Bayesian neural networks can be estabilished (Neal 1996, Rasmussen and WIlliams 2006). Tommaso Rigon TGP 2017 15 / 36
  • 16. Gaussian Processes A naive Bayesian model via GPs We assume that the response y depends on multivariate covariates x through the following functional specification y(x) = f(x) + (x), ∀x ∈ X, where f(x) represents the signal and (x) the noise. Despite the functional specification, we can observe only a finite number values (xi, yi). Assuming, additionally, that ∼ GP(0, k (x, x )) with k (x, x ) = σ2 I(x = x ), we obtain yi | f(xi), σ2 ind ∼ N(f(xi), σ2 ), i = 1, . . . , n, independently. The elicitation is complete by specifying a functional prior f ∼ GP(0, k(x, x )) and an inverse gamma for σ2 as before. Tommaso Rigon TGP 2017 16 / 36
  • 17. Gaussian Processes Posterior inference via Gibbs sampling Full conditionals (kriging equations) The Gibbs sampling alternates between f(x) | y, σ2 ∼ Nn( ˜m(x), ˜K(x, x)), and σ−2 | y, f(x) ∼ Gamma(˜aσ,˜bσ), where ˜m(x) = K(x, x)(K(x, x) + σ2 In)−1 y = 1 σ2 ˜K(x, x)y and ˜K(x, x)) = K(x, x)−1 + 1 σ2 In −1 = K(x, x) − K(x, x)(K(x, x) + σ2 In)−1 K(x, x) Tommaso Rigon TGP 2017 17 / 36
  • 18. Gaussian Processes A more complete specification Incorporating basis expansions (Blight and Ott, 1975, O’Hagan, 1978) We assume that y(x) = h(x)T β + f(x) + (x), ∀x ∈ X, where h(x)T β and f(x) represents the signals and (x) the noise. Moreover, let f(x) ∼ GP(0, τ2 k∗ (x, x | l)), where k∗ (x, x | l) is a correlation function, and ∼ GP(0, k (x, x | g)) with k (x, x | g) = gτ2 I(x = x ), so that y(x) | β, τ2 , g ∼ GP h(x)T β, τ2 k(x, x | g, l) . where k(x, x | g, l) = k∗ (x, x | l) + k (x, x | g). Finally, we let β | τ2 ∼ NM (b, τ2 B), τ−2 ∼ Gamma(aτ , bτ ). Tommaso Rigon TGP 2017 18 / 36
  • 19. Gaussian Processes Some difficulties Covariance parameters (g, l) The parameters (g, l) play a crucial roles in fitting a GPs but their prior elicitation is more delicate. Gramacy and Lee (2008) propose: p(g, l) = p(g)p(l) = p(g) 1 2 Gamma(l | 1, 20) + 1 2 Gamma(l | 10, 10) , with p(g) = Exp(g | ag). Metropolis-Hastings within Gibbs for (g, l) Analytically integrating out β and τ2 gives a marginal posterior for K(x, x | g, l) (Berger et al., 2001), that can be used to obtain efficient MH draws. Tommaso Rigon TGP 2017 19 / 36
  • 20. Gaussian Processes Prior distribution p(l) 0.0 2.5 5.0 7.5 10.0 0.0 0.5 1.0 1.5 2.0 l p(l) Tommaso Rigon TGP 2017 20 / 36
  • 21. Gaussian Processes Application to the motorcycle dataset −100 0 0 20 40 60 times (millisecond) accel(g) Tommaso Rigon TGP 2017 21 / 36
  • 22. Gaussian Processes Brief summary about Gaussian Processes Advantages GPs are a powerful tool for nonparametric regression. GPs are conceptually straightforward and can easily accomodate prior knowledge. Uncertainty quantification, e.g. posterior credible intervals, can be easily taken into account. Disadvantages GP models are usually “stationary”, meaning that the same covariance matrix is used throughout X, which may be a strong assumption. Moreover, non stationary models are often computationally intractable. Although some fast approximations exist, fitting a GP usually require the inversion of n × n matrixes, which has a computing time of O(n3 ). Tommaso Rigon TGP 2017 22 / 36
  • 23. Bayesian CART Classification and Regression Trees (CART) Regression trees (Breiman et al. 1984) CART models are a regression method that recursively partition the predictor space into subsets, usually through a greedy algorithm, so that the distribution of y is more and more homogeneous. ...and related methods Modifications and extensions of CART (e.g. MARS, random forest, AdaBoosting and gradient boosting), are perhaps one of the most widely used tool for regression in the machine learning community. Bayesian CART (Chipman et al. 1998, Denison et al. 1998) Bayesian modifications of CART were later proposed, introducing a prior distribution on the partition of the covariates. Tommaso Rigon TGP 2017 23 / 36
  • 24. Bayesian CART Bayesian CART model A binary tree T subdivides X into S non-overlapping regions {R1, . . . , RS}, so that X = S s=1 Rs. This is obtained recursively, splitting at each step a previously obtained region into sub-regions. Each region Rs contains data (Xs, ys), comprising a total of ns observations, for s = 1, . . . , S. Conditionally to the tree structure, the CART model assumes independent observations among regions and observations, that is yis | θ, T ind ∼ N(yis | µs, σ2 s ), i = 1 . . . , ns, s = 1, . . . , S. Prior distribution for θ = (µ, σ2 ) are given conditionally to T , i.e. following the standard Gaussian - inverse gamma specification. Tommaso Rigon TGP 2017 24 / 36
  • 25. Bayesian CART An example of tree partitioning x1 <> 24.2 x1 <> 13.8 1e−04 20 obs 1 x1 <> 17.6 0.0074 11 obs 2 0.0033 15 obs 3 x1 <> 38 0.0724 28 obs 4 0.0053 20 obs 5 A tree of height=4, log(p)=107.594 Tommaso Rigon TGP 2017 25 / 36
  • 26. Bayesian CART A tree prior for T An implicitely defined tree prior (Chipman et al., 1998) The prior stochastic process for T is described here in a recursive manner. Begin by setting T to be the trivial tree with a single region R = X. Each terminal region Rs splits in Rs1 ∪ Rs2 with probability a(1 + qRs )−b , where qRs is the depth of Rs, i.e. the number of splits above Rs. The split rule consists is chosing randomly and uniformly among the values of the observed covariates X. If new regions are created, repeat step 2 until the process stops. Tommaso Rigon TGP 2017 26 / 36
  • 27. Bayesian CART Prior distribution: # of terminal nodes0.00.20.4 alpha=0.5 and beta=0.5 Number of terminal nodes Probability 1 3 5 7 9 11 13 15 17 19 21 0.000.040.08 alpha=0.95 and beta=0.5 Number of terminal nodes Probability 1 6 12 19 26 33 40 47 54 61 78 0.000.100.20 alpha=0.95 and beta=1 Number of terminal nodes Probability 1 3 5 7 9 11 13 15 17 19 0.00.20.4 alpha=0.95 and beta=1.5 Number of terminal nodes Probability 1 2 3 4 5 6 7 8 9 10 12 15 Tommaso Rigon TGP 2017 27 / 36
  • 28. Bayesian CART Posterior inference A RJ-MCMC algorithm (Chipman et al. 1998) A reversible jump Metropolis-Hastings algorithm is used for posterior computation, which involves the following reversible steps GROW. Randomly pick a terminal node and split it into two new ones. PRUNE. Randomly pick a parent of two terminal nodes and turn it into a terminal node. CHANGE. Randomly pick an internal node and randomly reassign it a splitting rule. SWAP. Randomly pick a parent-child pair and swap their splitting rule. Tommaso Rigon TGP 2017 28 / 36
  • 29. Bayesian CART Application to the motorcycle dataset −100 −50 0 50 0 20 40 60 times (millisecond) accel(g) Tommaso Rigon TGP 2017 29 / 36
  • 30. Bayesian CART Limitations and extensions Potential limitations Slow mixing. The entire MCMC procedure get easily stuck into some local mode and therefore it is recommended to restart the chain several times to explore different solutions. Difficulties in capturing smooth or even linear behaviour via piecewise constant functions. Extensions A treed Bayesian linear model was proposed by Chipman et al. (2002), in which at each terminal node is assumed to be a linear model. An additive Bayesian model based on Bayesian CART, called BART, was proposed by Chipman et al. (2010), which mostly solves the issues underlined above. Tommaso Rigon TGP 2017 30 / 36
  • 31. Treed Gaussian Process Model Treed Gaussian Processes The conditional model (Gramacy and Lee 2008) Conditionally to a tree structure T , the treed Gaussian process model assumes y(x) = h(x)T βs + fs(x) + s(x), ∀x ∈ Rs, s = 1, . . . , S, fs(x) ind ∼ GP(0, τ2 s k∗ (x, x | ls)), s ind ∼ GP(0, k (x, x | gs)), where h(x) = (1, x)T , k∗ (x, x | ls) and k (x, x | gs) = gsτ2 s I(x = x ) are correlations functions. Moreover, we have βs | bs, τ2 s , κ2 s, B ind ∼ NM (bs, τ2 s κ2 sB). For the tree structure T , the same prior as in Chipman et al. (1998) is assumed. Tommaso Rigon TGP 2017 31 / 36
  • 32. Treed Gaussian Process Model More on prior elicitation Hyperpriors... The model elicitation is completed by assuming bs i.i.d ∼ NM (b0, B0), B−1 s i.i.d ∼ W((ρV )−1 , ρ), τ−2 s i.i.d ∼ Gamma(aτ , bτ ), κ−2 s i.i.d ∼ Gamma(aκ, bκ), where W represent the Wishart distribution. Mixture priors for the parameters (gs, ls), for s = 1, . . . , S are also assumed, similarly to what discussed before. ...and hyperparameters Default values for the hyperparameters of the tree are suggested by the authors and setted equal to a = .5 and b = 2. Tommaso Rigon TGP 2017 32 / 36
  • 33. Treed Gaussian Process Model Posterior inference Posterior inference resembles the RJ-MCMC of Chipman et al. (1998), with an additional operation called ROTATE, which should improve the mixing, by providing a more dynamic set of candidate nodes for pruning. Conditionally to the tree structure, full conditional inference for most of the involved parameters is available, independently for each region, so that a Gibbs sampler can be setted up. Metropolis-Hastings within Gibbs steps are required for each pair (gs, ls). The usual predictive (kriging) equations are also available, conditionally to each region Rs. Tommaso Rigon TGP 2017 33 / 36
  • 34. Treed Gaussian Process Model Implementation and software Software (Gramacy 2007) An R package called tgp was made available, which can handle all the models discussed so far and even more, for instance allowing for parallelization of the MCMC chain. The tree GP model was coded in a mixture of C and C++, then wrapped by R commands. Centering and rescaling the input is reccommended by the authors, so that defualt Metropolis-Hastings proposal distributions can be used. Tommaso Rigon TGP 2017 34 / 36
  • 35. Treed Gaussian Process Model Application to the motorcycle dataset −100 0 0 20 40 60 times (millisecond) accel(g) Tommaso Rigon TGP 2017 35 / 36
  • 36. Treed Gaussian Process Model Discussion and possible extensions Limiting linear models (Gramacy and Lee 2008) In some cases a GP may not be needed within a partition, and a much simpler model, such as the linear model, may suffice. Linear models can be viewed as a particular case of a GP, and a model-switching prior distribution for the hyperparameters (g, l) would allow the practical implementation. Treed GP for classification (Broderick and Gramacy, 2010) Treed Gaussian processes could also be used when y ∈ {0, 1}, therefore assuming a flexible representation for P(yi = 1 | xi). A more complex input space X (Broderick and Gramacy, 2011) So far we have assumed that X = Rp , but more complex input space could be of interest, for instance when some of the inputs are categorical. Tommaso Rigon TGP 2017 36 / 36