1. Gaussian Process in Machine Learning
Subject: Machine Learning
Dr. Varun Kumar
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 1 / 16
2. Outlines
1 Introduction to Gaussian Distributed Random Variable
2 Central Limit Theorem
3 MLE Vs MAP
4 Gaussian Process for Linear Regression
5 References
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 2 / 16
3. Introduction to Gaussian Distributed Random Variable (rv)
Gaussian distribution
1 The general expression for the PDF of a uni-variate Gaussian
distributed random variable is
fX (x) =
1
√
2πσ
e−
(x−µ)2
2σ2
where, σ → Standard deviation, µ → Mean, σ2 → Variance
2 The general expression for the PDF of a multi-variate Gaussian
distributed random variable is
P(X, µx , Σ) =
1
(2π)d/2
det|Σ|
e−1
2
(X−µx )T Σ−1(X−µx )
X → d-dimensional input random vector, i.e X = [x1, x2, ....., xd ]T
µx → d-dimensional mean vector, i.e µx = [µx1
, µx2
, ....., µxd
]T
Σ → Co-variance matrix of size d × d
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 3 / 16
4. Properties of Gaussian distributed random variable
1. Addition of two Gaussian distributed rv is also a Gaussian. Let
X1 ∼ N(µX1
, ΣX1X1
) and X1 ∼ N(µX2
, ΣX2X2
) are two Gaussian distributed
rv.
Z = X1 + X2 ∼ N(µX1
+ µX2
, ΣX1X1
ΣX2X2
)
2. Normalization is also a Gaussian.
Z =
Z
y
p(y, µ, Σ)dy = 1 → Gaussian distribution
3. Marginalization is also a Gaussian distribution.
p(X1) =
Z ∞
0
p(X1, X2, µ, Σ)dX2 → Gaussian distribution
4. Conditioning: The conditional distribution of X1 on X2
p(X1/X2) =
p(X1, X2, µ, Σ)
R
X1
p(X1, X2, µ, Σ)dX1
→ Gaussian distribution
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 4 / 16
5. Central limit theorem
⇒ Let {X1, . . . , Xn} be a random sample of size n.
⇒ All random sample are independent and identically distributed (i.i.d.).
⇒ The sample average
X̄n =
X1 + X2 + .... + Xn
n
, n → ∞ ⇒ Gaussian distribution
⇒ By the law of large numbers, the sample averages converge almost
surely to the expected value µ and variance σ2.
⇒ Let Z be the expectation, where Z =
√
nX̄n−µ
σ
lim n→∞
⇒ Resultant PDF
f =
1
√
2πσ
e−
(X̄n−µ)
2σ2 =
1
√
2π
e−Z2
2
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 5 / 16
7. MLE vs MAP
Maximum likelihood estimator (MLE)
Let y = ax + n, where n ∼ N(0, σ2)
x̂MLE (y) = arg
max x
fY (y/x) =
1
√
2πσ
e−
(y−ax)2
2σ2
Measure y = ȳ = ax̂MLE
Note: There is no requirement of the distribution of x.
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 7 / 16
8. Maximum aposteriori probability (MAP)
1 Maximum apriori
xapriori = arg
max x
fX (x)
2 Maximum aposteriori probability (MAP)
x̂MAP = arg
max x
fX (x/y) =
fY (y/x)fX (x)
fY (y)
=
fY (y/x)fX (x)
R
X fY (y/x)fX (x)dx
⇒ If xapriori is uniformly distributed then
x̂MLE = x̂MAP
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 8 / 16
9. Linear regression
Let we have a data, D = {(x1, y1), ....., (xn, yn)}
⇒ MLE: p(D/W ) =
Qn
i=1 p(yi /xi ; w) ∀ p(yi /xi ; w) ∼ N(W T X, σ2I)
⇒ MAP: p(W /D) ∝ p(D/W )p(W )
p(D) = p(D/W )p(W )
R
W p(D/W )p(W )dw
⇒
p(y/x; D) =
Z
w
p(y/x; w)p(w/D)dw
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 9 / 16
10. Continued–
In general, the posterior predictive distribution is
P(Y |D, X) =
Z
w
P(Y , w|D, X)dw =
Z
w
P(Y |w, D, X)P(w|D)dw
The above is often intractable in closed form.
The mean and covariance of the given expression can be written as
P(y|D, x) ∼ N(µy|D, Σy|D)
where
µy|D = KT
∗ (K + σ2
I)−1
y
and
Σy∗|D = KKT
∗ (K + σ2
I)−1
K∗
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 10 / 16
11. Gaussian process
⇒ Problem:
f is an infinite dimensional function. But, the multivariate Gaussian
distributions is for finite dimensional random vectors.
⇒ Definition: A GP is a collection of random variables (RV) such that
the joint distribution of every finite subset of RVs is multivariate
Gaussian:
f ∼ GP(µ, k)
where µ(x) and k(x, x0) are the mean and covariance function.
⇒ Need to model the predictive distribution P(f∗|x, D).
⇒ We can use a Bayesian approach by using a GP prior:
P(f |x) ∼ N(µ, Σ) and condition it on the training data D to model
the joint distribution of f = f (X) (vector of training observations)
and f∗ = f (x∗) (prediction at test input).
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 11 / 16
12. Gaussian Process Regression GPR
We observe the training labels that are drawn from the zero-mean prior Gaussian :
y = [y1, y2, ...., yn, yt]T
∼ N(0, Σ)
⇒ All training and test labels are drawn from an (n+m)-dimension Gaussian
distribution.
⇒ n is the number of training points.
⇒ m is the number of testing points.
We consider the following properties of Σ :
1 Σij = E((Yi − µi )(Yj − µj ))
2 Σ is always positive semi-definite.
3 Σii = Var(Yi ), thus Σii ≥ 0
4 If Yi and Yj are very independent, i.e. xi is very different from xj , then
Σii = Σij = 0. If xi is similar to xj , then Σij = Σji > 0
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 12 / 16
13. Continued–
We can observe that this is very similar from the kernel matrix in SVMs.
Therefore, we can simply let Σij = K(xi , xj ). For example,
(a) If we use RBF kernel
Σij = τe−
kxi −xj k2
2σ2
(b) If we use polynomial kernel, then Σij = τ(1 + xT
i xj )d .
We can decompose Σ as
Σ =
K, K∗
KT
∗ , K∗∗
where
K is the training kernel matrix.
K∗ is the training-testing kernel matrix.
KT
∗ is the testing-training kernel matrix
K∗∗ is the testing kernel matrix
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 13 / 16
14. Continued–
The conditional distribution of (noise-free) values of the latent function f
can be written as:
f∗|(Y1 = y1, ..., Yn = yn, x1, ..., xn, xt) ∼ N(KT
∗ K−1
y, K∗∗ − KT
∗ K−1
K∗)
,
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 14 / 16
15. Conclusion
Gaussian Process Regression has the following properties:
1 GPs are an elegant and powerful ML method.
2 We get a measure of uncertainty for the predictions for free.
3 GPs work very well for regression problems with small training data
set sizes.
4 Running time O(n3) ← matrix inversion (gets slow when n 0 ) ⇒
use sparse GPs for large n.
5 GPs are a little bit more involved for classification (non-Gaussian
likelihood).
6 We can model non-Gaussian likelihoods in regression and do
approximate inference for e.g., count data (Poisson distribution)
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 15 / 16
16. References
T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University,
School of Computer Science, Machine Learning , 2006, vol. 9.
E. Alpaydin, Introduction to machine learning. MIT press, 2020.
K. Weinberger,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote15.html,
May 2018.
Subject: Machine Learning Dr. Varun Kumar (IIIT Surat) Lecture 15 16 / 16