Introduction to Gaussian Processes

Dmytro Fishman
(dmytro@ut.ee)
Introduction to
Gaussian Processes

x f(x) y
Let’s take a look inside

x
Let be linear functiony= f(x)
y= ✓0 + ✓1x
y

x
Let be linear functiony= f(x)
arg min
✓
nX
i=1
(yi ˆyi)2
ˆyi = ✓0 + ✓1xi
yi = ✓0 + ✓1xi + ✏i
arg min
✓
nX
i=1
(yi ˆyi)2
ˆyi = ✓0 + ✓1xi
yi = ✓0 + ✓1xi + ✏i
Find and by optimising
error
arg min
✓
nX
i=1
(yi ˆyi)2
ˆyi = ✓0 + ✓1xi
y = ✓ + ✓ x + ✏
arg min
✓
nX
i=1
(yi ˆyi)2
ˆyi = ✓0 + ✓1xi
y = ✓ + ✓ x + ✏
y= ✓0 + ✓1x
y

x
But if data is not linear?
y

x
y= ✓0 + ✓1x
y

x
y= ✓0 + ✓1x + ✓2x2
y

x
y= ✓0 + ✓1x + ✓2x2
+ ✓3x3
y

x
What if don’t want to
assume a speciﬁc form?
y

x
GPs let you model
any function directly
y

x
y
x
y
Parametric ML Nonparametric ML
A learning model that
summarizes data with a set
of parameters of ﬁxed size
(independent of the number
of training examples) is
called a parametric model.
Algorithms that do not make
strong assumptions about
the form of the mapping
function are called
nonparametric machine
learning algorithms.
y= ✓0 + ✓1x

x
y
x
y
Parametric ML Nonparametric ML
A learning model that
summarizes data with a set
of parameters of ﬁxed size
(independent of the number
of training examples) is
called a parametric model.
Algorithms that do not make
strong assumptions about
the form of the mapping
function are called
nonparametric machine
learning algorithms.
y= ✓0 + ✓1x
Question: is K-nearest neighbour parametric
or nonparametric algorithm according to
these deﬁnitions?

x
y
GPs let you model
any function directly
estimates the uncertainty
for each new prediction

x
y
If I ask you to predict for
xi
xiyi

x
y
If I ask you to predict for
?
xiyi
You better be very uncertain
xi

We will need
Normal distribution
x xi
y

µ
1p
2⇡
e
(x µ)2
2 2
With average coordinate and standard
deviation from centre
µ
Many important processes follow normal
distribution

µ
1p
2⇡
e
(x µ)2
2 2N(µ, 2
)
µ
distribution

X1 ⇠ N(µ1, 2
1)1p
2⇡
e
(x µ)2
2 2
µ
distribution
µ1
1

1p
2⇡
e
(x µ)2
2 2
What If I draw
another distribution?
µ
distribution
X1 ⇠ N(µ1, 2
1)
µ1
1

X1 ⇠ N(µ1, 2
1) X2 ⇠ N(µ2, 2
2)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0

X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)

µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
What if we would join them into one plot?

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1

X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X =

x1
x2
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
Joint distribution of variables and
X =

x1
x2
x1 x2
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2 ⇠ N(0, 1)
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2 ⇠ N(0, 1)
X2
X1
Covariance matrix
or ⌃

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2 ⇠ N(0, 1)
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables andx1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2 ⇠ N(0, 1)
X2
X1

µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 ⇠ N(0, 1)

x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X2 ⇠ N(0, 1)
X2
X1
Similarity


x1
x2
⇠ N
✓
0
0

1 0
0 1
◆
X10
0
X2
X10
0
X2

x1
x2
⇠ N
✓
0
0

1 0.5
0.5 1
◆
Positive value of does not
tell much about
X1
X2
Some similarity (correlation)
Positive value of with
good probability means
positive
X1
X2
No similarity (no correlation)

µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
x1 x2
X1 ⇠ N(0, 1) X2 ⇠ N(0, 1)
X2
X1

x1
x2
⇠ N
✓
0
0

1 0.5
0.5 1
◆

X2
X1
P(x1, x2)

x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆

X2
X1
P(x1, x2)
x2

x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆

X2
X1
P(x1, x2)
x2
1|2
µ1|2

x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆

X2
X1
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)

x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆


x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆
X2
X1
P(x1, x2)
x2
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21

X2
X1
x2
1|2
µ1|2

x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
P(x1, x2)
Joint distribution

x1
x2
⇠ N
✓
0
0

1 0.5
0.5 1
◆
X10
0
X2
N(µ, 2
)
Normal distribution
or 1D Gaussian
or 2D Gaussian

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
Sampling
Samples from
2D Gaussian

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)
Sampling
Samples from
2D Gaussian

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)
Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
There is little
dependency

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
There is little
dependency

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
Sampling
(0.13,0.52)
There is little
dependency
Samples from
2D Gaussian

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
Sampling
(0.13,0.52)
There is little
dependency
0
1st 2nd
1
1
Samples from
2D Gaussian

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
More dependent
values

2D
Gaussian
⇠
✓
0
0

1 0
0 1
◆
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
✓
0
0

1 0.5
0.5 1
◆
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
More dependent
values
How would a sample from 20D
Gaussian look like?

20D
Gaussian
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0 0 . . . 0
0 1 0 . . . 0
...
...
...
...
...
0 0 0 . . . 1
3
7
7
7
5
1
C
C
C
A
Sampling
(0.73, -0.12, 0.42, 1.2,…, 16 more)
0
1st 2nd
1
1
3rd 4th 5th 6th 7th

20D
Gaussian
Let’s add more dependency
between points
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A

20D
Gaussian
between points
(0.73, 0.18, 0.68, -0.2,…, 16 more)
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A

20D
Gaussian
between points
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
We want some notion of smoothness between points…
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A

20D
Gaussian
between points
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd.
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A

20D
Gaussian
between points
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
We might have just increased corresponding values in
covariance matrix, right?

20D
Gaussian
between points
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
We might have just increased corresponding values in
covariance matrix, right?
We need a way to generate a “smooth” covariance
matrix automatically depending on the distance
between points

We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
20D
Gaussian

Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
20D
Gaussian
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K120
K21 K22 K23 . . . K220
...
...
...
...
...
K201 K202 K203 . . . K2020
3
7
7
7
5
1
C
C
C
A
0
1st 2nd
1
1
3rd 4th 5th 6th 7th

Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A

Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
µ⇤ ⇤
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A

f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Previously we were using:
to generate correlated points,
can we do it again here?

f1
f2
⇠
✓
0
0

1 0.5
0.5 1
◆

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Wait! But now we have three
points, we cannot use the
same formula!

f1
f2
⇠
✓
0
0

1 0.5
0.5 1
◆
Previously we were using:
to generate correlated points,
can we do it again here?

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Ok… What about now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A
Wait, did he just said that f2
should be more correlated
to f1 than to f3?

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A
Wait, did he just said that f2
should be more correlated
to f1 than to f3?
Arrrr….

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A
Yes, but what if we want to
obtain this matrix
automatically based on how
close points are by (Z)?

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A
Yes, but what if we want to
obtain this matrix
automatically based on how
close points are by (Z)?
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj

f1
f2
f3
for given
Z
Z
z1 z2 z3
F(z)
F(z)
f1f2 f3
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
So now, it will become:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
Which is the same as saying:
f ⇠ N(0, K)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
f ⇠ N(0, K)
But how do we model f*?

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
f ⇠ N(0, K)
Well, probably again some
kinda normal…

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
f ⇠ N(0, K)
Well, probably again some
kinda normal…
Maybe something like:
f⇤ ⇠ N(0, ?)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K)
f⇤ ⇠ N(0, ?)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K)
f⇤ ⇠ N(0, ?)
But what is this “?”
covariance matrix of z* with z*?

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K)
f⇤ ⇠ N(0, ?)
f⇤ ⇠ N(0, K⇤⇤)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K)
f⇤ ⇠ N(0, ?)
f⇤ ⇠ N(0, K⇤⇤)
But isn’t K** is just 1?
K⇤⇤ = e ||z⇤ z⇤||2
= 1

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Only one entity is left:
K1⇤ = K(z1, z⇤)

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Only one entity is left:
K1⇤ = K(z1, z⇤)
I guess we know how to
calculate this one!
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!
Wait… but what we do now?

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!
Wait… but what we do now?
Remember….


x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆
X2
X1
P(x1, x2)
x2
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
What if we substitute x1 with f* and x2 with f?


x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆
X2
X1
P(x1, x2)
x2
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
Then we can compute mean and standard deviation of f*!


x1
x2 i
⇠ N
✓
µ1
µ2

11 12
21 22
◆
X2
X1
P(x1, x2)
x2
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
Then we can compute mean and standard deviation of f*!
Exactly!

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤ ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
z⇤ z⇤
µ⇤
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤also given
f⇤
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇥
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤
µ⇤
z⇤z⇤ z⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤

What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤also given
f⇤
µ⇤
µ⇤
µ⇤
z⇤z⇤ z⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤

Pros:
1. Can model almost any function directly
3. Provides uncertainty estimates
2. Can be made more flexible with different kernels
Cons:
1. Cannot be interpreted
2. Loose efficiency in high dimensional spaces
3. Overfitting

Cat or Dog?
“It’s always seemed obvious to me that it’s better to know that
you don’t know, than to think you know and act on wrong
information.”
Katherine Bailey

Teaching statistics Doing statistics

Resources:
Katherine Bailey’s presentation: http://katbailey.github.io/gp_talk/
Gaussian_Processes.pdf
Katherine Bailey’s blog post: from both sides now: the math of linear
regression (http://katbailey.github.io/post/from-both-sides-now-the-
math-of-linear-regression/)
Katherine Bailey’s blog post: Gaussian processes for dummies (http://
katbailey.github.io/post/gaussian-processes-for-dummies/)
Kevin P. Murphy’s book: Machine Learning - A Probabilistic
Perspective, Chapter 15 (https://www.amazon.com/Machine-Learning-
Probabilistic-Perspective-Computation/dp/0262018020)
Alex Bridgland’s blog post: Introduction to Gaussian Processes - Part I
(http://bridg.land/posts/gaussian-processes-1)
Nando de Freitas, Machine Learning - Introduction to Gaussian
Processes (https://youtu.be/4vGiHC35j9s)

Introduction to Gaussian Processes

More Related Content

What's hot

Similar to Introduction to Gaussian Processes

More from Dmytro Fishman

Recently uploaded

Introduction to Gaussian Processes