Gaussian process in machine learning

Gaussian Process in Machine Learning
Subject: Machine Learning
Dr. Varun Kumar
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 1 / 16

Outlines
1 Introduction to Gaussian Distributed Random Variable
2 Central Limit Theorem
3 MLE Vs MAP
4 Gaussian Process for Linear Regression
5 References

Introduction to Gaussian Distributed Random Variable (rv)
Gaussian distribution
1 The general expression for the PDF of a uni-variate Gaussian
distributed random variable is
fX (x) =
1
√
2πσ
e−
(x−µ)2
2σ2
where, σ → Standard deviation, µ → Mean, σ2 → Variance
2 The general expression for the PDF of a multi-variate Gaussian
distributed random variable is
P(X, µx , Σ) =
1
(2π)d/2
det|Σ|
e−1
2
(X−µx )T Σ−1(X−µx )
X → d-dimensional input random vector, i.e X = [x1, x2, ....., xd ]T
µx → d-dimensional mean vector, i.e µx = [µx1
, µx2
, ....., µxd
]T
Σ → Co-variance matrix of size d × d

Properties of Gaussian distributed random variable
1. Addition of two Gaussian distributed rv is also a Gaussian. Let
X1 ∼ N(µX1
, ΣX1X1
) and X1 ∼ N(µX2
, ΣX2X2
) are two Gaussian distributed
rv.
Z = X1 + X2 ∼ N(µX1
+ µX2
, ΣX1X1
ΣX2X2
)
2. Normalization is also a Gaussian.
Z =
Z
y
p(y, µ, Σ)dy = 1 → Gaussian distribution
3. Marginalization is also a Gaussian distribution.
p(X1) =
Z ∞
0
p(X1, X2, µ, Σ)dX2 → Gaussian distribution
4. Conditioning: The conditional distribution of X1 on X2
p(X1/X2) =
p(X1, X2, µ, Σ)
R
X1
p(X1, X2, µ, Σ)dX1
→ Gaussian distribution

Central limit theorem
⇒ Let {X1, . . . , Xn} be a random sample of size n.
⇒ All random sample are independent and identically distributed (i.i.d.).
⇒ The sample average
X̄n =
X1 + X2 + .... + Xn
n
, n → ∞ ⇒ Gaussian distribution
⇒ By the law of large numbers, the sample averages converge almost
surely to the expected value µ and variance σ2.
⇒ Let Z be the expectation, where Z =
√
nX̄n−µ
σ
lim n→∞
⇒ Resultant PDF
f =
1
√
2πσ
e−
(X̄n−µ)
2σ2 =
1
√
2π
e−Z2
2

Continued–

MLE vs MAP
Maximum likelihood estimator (MLE)
Let y = ax + n, where n ∼ N(0, σ2)
x̂MLE (y) = arg
max x
fY (y/x) =
1
√
2πσ
e−
(y−ax)2
2σ2
Measure y = ȳ = ax̂MLE
Note: There is no requirement of the distribution of x.

Maximum aposteriori probability (MAP)
1 Maximum apriori
xapriori = arg
max x
fX (x)
2 Maximum aposteriori probability (MAP)
x̂MAP = arg
max x
fX (x/y) =
fY (y/x)fX (x)
fY (y)
=
fY (y/x)fX (x)
R
X fY (y/x)fX (x)dx
⇒ If xapriori is uniformly distributed then
x̂MLE = x̂MAP

Linear regression
Let we have a data, D = {(x1, y1), ....., (xn, yn)}
⇒ MLE: p(D/W ) =
Qn
i=1 p(yi /xi ; w) ∀ p(yi /xi ; w) ∼ N(W T X, σ2I)
⇒ MAP: p(W /D) ∝ p(D/W )p(W )
p(D) = p(D/W )p(W )
R
W p(D/W )p(W )dw
⇒
p(y/x; D) =
Z
w
p(y/x; w)p(w/D)dw

Gaussian process
⇒ Problem:
f is an infinite dimensional function. But, the multivariate Gaussian
distributions is for finite dimensional random vectors.
⇒ Definition: A GP is a collection of random variables (RV) such that
the joint distribution of every finite subset of RVs is multivariate
Gaussian:
f ∼ GP(µ, k)
where µ(x) and k(x, x0) are the mean and covariance function.
⇒ Need to model the predictive distribution P(f∗|x, D).
⇒ We can use a Bayesian approach by using a GP prior:
P(f |x) ∼ N(µ, Σ) and condition it on the training data D to model
the joint distribution of f = f (X) (vector of training observations)
and f∗ = f (x∗) (prediction at test input).

Gaussian Process Regression GPR
We observe the training labels that are drawn from the zero-mean prior Gaussian :
y = [y1, y2, ...., yn, yt]T
∼ N(0, Σ)
⇒ All training and test labels are drawn from an (n+m)-dimension Gaussian
distribution.
⇒ n is the number of training points.
⇒ m is the number of testing points.
We consider the following properties of Σ :
1 Σij = E((Yi − µi )(Yj − µj ))
2 Σ is always positive semi-definite.
3 Σii = Var(Yi ), thus Σii ≥ 0
4 If Yi and Yj are very independent, i.e. xi is very different from xj , then
Σii = Σij = 0. If xi is similar to xj , then Σij = Σji > 0

Continued–
We can observe that this is very similar from the kernel matrix in SVMs.
Therefore, we can simply let Σij = K(xi , xj ). For example,
(a) If we use RBF kernel
Σij = τe−
kxi −xj k2
2σ2
(b) If we use polynomial kernel, then Σij = τ(1 + xT
i xj )d .
We can decompose Σ as
Σ =

K, K∗
KT
∗ , K∗∗

where
K is the training kernel matrix.
K∗ is the training-testing kernel matrix.
KT
∗ is the testing-training kernel matrix
K∗∗ is the testing kernel matrix

Continued–
The conditional distribution of (noise-free) values of the latent function f
can be written as:
f∗|(Y1 = y1, ..., Yn = yn, x1, ..., xn, xt) ∼ N(KT
∗ K−1
y, K∗∗ − KT
∗ K−1
K∗)
,

Conclusion
Gaussian Process Regression has the following properties:
1 GPs are an elegant and powerful ML method.
2 We get a measure of uncertainty for the predictions for free.
3 GPs work very well for regression problems with small training data
set sizes.
4 Running time O(n3) ← matrix inversion (gets slow when n 0 ) ⇒
use sparse GPs for large n.
5 GPs are a little bit more involved for classification (non-Gaussian
likelihood).
6 We can model non-Gaussian likelihoods in regression and do
approximate inference for e.g., count data (Poisson distribution)

References
T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University,
School of Computer Science, Machine Learning , 2006, vol. 9.
E. Alpaydin, Introduction to machine learning. MIT press, 2020.
K. Weinberger,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote15.html,
May 2018.

Gaussian process in machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gaussian process in machine learning

Similar to Gaussian process in machine learning (20)

More from VARUN KUMAR

More from VARUN KUMAR (20)

Recently uploaded

Recently uploaded (20)

Gaussian process in machine learning