Dmytro Fishman
(dmytro@ut.ee)
Introduction to
Gaussian Processes
x f(x) y
Let’s take a look inside
x
y= f(x)
y
x
Let be linear functiony= f(x)
y= āœ“0 + āœ“1x
y
x
Let be linear functiony= f(x)
y= āœ“0 + āœ“1x
y
x
Let be linear functiony= f(x)
arg min
āœ“
nX
i=1
(yi ˆyi)2
ˆyi = āœ“0 + āœ“1xi
yi = āœ“0 + āœ“1xi + āœi
arg min
āœ“
nX
i=1
(yi ˆyi)2
ˆyi = āœ“0 + āœ“1xi
yi = āœ“0 + āœ“1xi + āœi
Find and by optimising
error
arg min
āœ“
nX
i=1
(yi ˆyi)2
ˆyi = āœ“0 + āœ“1xi
y = āœ“ + āœ“ x + āœ
arg min
āœ“
nX
i=1
(yi ˆyi)2
ˆyi = āœ“0 + āœ“1xi
y = āœ“ + āœ“ x + āœ
y= āœ“0 + āœ“1x
y
x
But if data is not linear?
y
x
But if data is not linear?
y= āœ“0 + āœ“1x
y
x
y= āœ“0 + āœ“1x + āœ“2x2
But if data is not linear?
y
x
But if data is not linear?
y= āœ“0 + āœ“1x + āœ“2x2
+ āœ“3x3
y
x
What if don’t want to
assume a specific form?
y
x
GPs let you model
any function directly
y
x
y
x
y
Parametric ML Nonparametric ML
A learning model that
summarizes data with a set
of parameters of fixed size
(independent of the number
of training examples) is
called a parametric model.
Algorithms that do not make
strong assumptions about
the form of the mapping
function are called
nonparametric machine
learning algorithms.
y= āœ“0 + āœ“1x
x
y
x
y
Parametric ML Nonparametric ML
A learning model that
summarizes data with a set
of parameters of fixed size
(independent of the number
of training examples) is
called a parametric model.
Algorithms that do not make
strong assumptions about
the form of the mapping
function are called
nonparametric machine
learning algorithms.
y= āœ“0 + āœ“1x
Question: is K-nearest neighbour parametric
or nonparametric algorithm according to
these definitions?
x
GPs let you model
any function directly
y
x
y
GPs let you model
any function directly
estimates the uncertainty
for each new prediction
x
y
If I ask you to predict for
xi
xiyi
x
y
If I ask you to predict for
?
xiyi
You better be very uncertain
xi
How is it even possible?
We will need
Normal distribution
x xi
y
µ
1p
2⇔
e
(x µ)2
2 2
With average coordinate and standard
deviation from centre
µ
Many important processes follow normal
distribution
µ
1p
2⇔
e
(x µ)2
2 2N(µ, 2
)
With average coordinate and standard
deviation from centre
µ
Many important processes follow normal
distribution
X1 ⇠ N(µ1, 2
1)1p
2⇔
e
(x µ)2
2 2
With average coordinate and standard
deviation from centre
µ
Many important processes follow normal
distribution
µ1
1
1p
2⇔
e
(x µ)2
2 2
What If I draw
another distribution?
With average coordinate and standard
deviation from centre
µ
Many important processes follow normal
distribution
X1 ⇠ N(µ1, 2
1)
µ1
1
X1 ⇠ N(µ1, 2
1) X2 ⇠ N(µ2, 2
2)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
X1 X20 0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
What if we would join them into one plot?
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
X2
X1
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X =

x1
x2
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
Joint distribution of variables and
X =

x1
x2
x1 x2
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
Joint distribution of variables and
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
X2 ⇠ N(0, 1)
Joint distribution of variables and
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
Joint distribution of variables and
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
Joint distribution of variables and
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
0
M =

µ1
µ2
Joint distribution of variables and
X =

x1
x2
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
Covariance matrix
or ⌃
µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables and
0
x1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables andx1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables andx1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
Similarity

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X10
0
X2
X10
0
X2

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Positive value of does not
tell much about
X1
X2
Some similarity (correlation)
Positive value of with
good probability means
positive
X1
X2
No similarity (no correlation)
µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables andx1 x2
X1 ⇠ N(0, 1)

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0
0 1
ā—†
X2 ⇠ N(0, 1)
X2
X1
µ1 = 0 1 = 1 µ2 = 0 2 = 1
Joint distribution of variables and
0
x1 x2
X1 ⇠ N(0, 1) X2 ⇠ N(0, 1)
X2
X1

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
1|2
µ1|2

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
X2
X1
x2
1|2
µ1|2

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
Conditional distribution
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
P(x1, x2)
Joint distribution

x1
x2
⇠ N
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
X10
0
X2
N(µ, 2
)
Normal distribution
or 1D Gaussian
or 2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
Sampling
Samples from
2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)
Sampling
Samples from
2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)
Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)
Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
There is little
dependency
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
There is little
dependency
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling
(0.13,0.52)
There is little
dependency
Samples from
2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling
(0.13,0.52)
There is little
dependency
0
1st 2nd
1
1
Samples from
2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling
(0.13,0.52)
There is little
dependency
0
1st 2nd
1
1
Samples from
2D Gaussian
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
More dependent
values
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
(-0.23, 1.13)Sampling
Samples from
2D Gaussian
0
1st 2nd
1
1
(-1.14, 0.65)
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Sampling (0.13,0.52)
Samples from
2D Gaussian
There is little
dependency
0
1st 2nd
1
1
(-0.03,-0.24)
More dependent
values
How would a sample from 20D
Gaussian look like?
2D
Gaussian
⇠
āœ“ļ£æ
0
0

1 0
0 1
ā—†
20D
Gaussian
?
20D
Gaussian
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0 0 . . . 0
0 1 0 . . . 0
...
...
...
...
...
0 0 0 . . . 1
3
7
7
7
5
1
C
C
C
A
Sampling
(0.73, -0.12, 0.42, 1.2,…, 16 more)
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
20D
Gaussian
Let’s add more dependency
between points
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
20D
Gaussian
Let’s add more dependency
between points
(0.73, 0.18, 0.68, -0.2,…, 16 more)
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
20D
Gaussian
Let’s add more dependency
between points
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
We want some notion of smoothness between points…
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
20D
Gaussian
Let’s add more dependency
between points
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
We want some notion of smoothness between points…
So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd.
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
20D
Gaussian
Let’s add more dependency
between points
We want some notion of smoothness between points…
So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd.
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
We might have just increased corresponding values in
covariance matrix, right?
20D
Gaussian
Let’s add more dependency
between points
We want some notion of smoothness between points…
So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd.
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
1 0.5 0.5 . . . 0.5
0.5 1 0.5 . . . 0.5
...
...
...
...
...
0.5 0.5 0.5 . . . 1
3
7
7
7
5
1
C
C
C
A
We might have just increased corresponding values in
covariance matrix, right?
We need a way to generate a ā€œsmoothā€ covariance
matrix automatically depending on the distance
between points
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
20D
Gaussian
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
20D
Gaussian
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K120
K21 K22 K23 . . . K220
...
...
...
...
...
K201 K202 K203 . . . K2020
3
7
7
7
5
1
C
C
C
A
0
1st 2nd
1
1
3rd 4th 5th 6th 7th
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
µ⇤ ⇤
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
200D
Gaussian
0
1
1
µ⇤ ⇤
0
B
B
B
@
2
6
6
6
4
0
0
...
0
3
7
7
7
5
2
6
6
6
4
K11 K12 K13 . . . K1200
K21 K22 K23 . . . K2200
...
...
...
...
...
K2001 K2002 K2003 . . . K200200
3
7
7
7
5
1
C
C
C
A
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Previously we were using:
to generate correlated points,
can we do it again here?

f1
f2
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Wait! But now we have three
points, we cannot use the
same formula!

f1
f2
⇠
āœ“ļ£æ
0
0

1 0.5
0.5 1
ā—†
Previously we were using:
to generate correlated points,
can we do it again here?
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Ok… What about now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Ok… What about now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A
Wait, did he just said that f2
should be more correlated
to f1 than to f3?
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Ok… What about now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.5 0.5
0.5 1 0.5
0.5 0.5 1
3
5
1
A
Wait, did he just said that f2
should be more correlated
to f1 than to f3?
Arrrr….
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A
Yes, but what if we want to
obtain this matrix
automatically based on how
close points are by (Z)?
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
Better now?
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5
2
4
1 0.7 0.2
0.7 1 0.5
0.2 0.5 1
3
5
1
A
Yes, but what if we want to
obtain this matrix
automatically based on how
close points are by (Z)?
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
f1
f2
f3
We are interested in modelling
for given
Z
Z
So that is more correlated with than
z1 z2 z3
F(z)
F(z)
f1f2 f3
We will use a similarity measure
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
So now, it will become:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
Which is the same as saying:
f ⇠ N(0, K)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
Which is the same as saying:
f ⇠ N(0, K)
But how do we model f*?
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
Which is the same as saying:
f ⇠ N(0, K)
But how do we model f*?
Well, probably again some
kinda normal…
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
2
4
f1
f2
f3
3
5 ⇠
0
@
2
4
0
0
0
3
5 ,
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
1
A
Which is the same as saying:
f ⇠ N(0, K)
But how do we model f*?
Well, probably again some
kinda normal…
Maybe something like:
f⇤ ⇠ N(0, ?)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K)
Maybe something like:
f⇤ ⇠ N(0, ?)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K)
Maybe something like:
f⇤ ⇠ N(0, ?)
But what is this ā€œ?ā€
covariance matrix of z* with z*?
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K)
Maybe something like:
f⇤ ⇠ N(0, ?)
But what is this ā€œ?ā€
covariance matrix of z* with z*?
f⇤ ⇠ N(0, K⇤⇤)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K)
Maybe something like:
f⇤ ⇠ N(0, ?)
But what is this ā€œ?ā€
covariance matrix of z* with z*?
f⇤ ⇠ N(0, K⇤⇤)
But isn’t K** is just 1?
K⇤⇤ = e ||z⇤ z⇤||2
= 1
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Only one entity is left:
K1⇤ = K(z1, z⇤)
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
What else is left?
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Only one entity is left:
K1⇤ = K(z1, z⇤)
I guess we know how to
calculate this one!
Kij = e ||zi zj ||2
=
(
0, ||zi zj|| ! 1
1, zi = zj
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!
Wait… but what we do now?
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
Yeah! We did it!
Wait… but what we do now?
Remember….

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
What if we substitute x1 with f* and x2 with f?

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
Then we can compute mean and standard deviation of f*!
What if we substitute x1 with f* and x2 with f?

x1
x2 i
⇠ N
āœ“ļ£æ
µ1
µ2

11 12
21 22
ā—†
X2
X1
Joint distribution of variables andx1 x2
P(x1, x2)
x2
Conditional distribution
1|2
µ1|2
P(x1|x2) = N(x1|µ1|2, 1|2)
µ1|2 = µ1 + 12 + T
22(x2 µ2)
1|2 = 11 12
T
22 21
Then we can compute mean and standard deviation of f*!
Exactly!
What if we substitute x1 with f* and x2 with f?
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤ ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
z⇤ z⇤
µ⇤
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
z⇤ z⇤
µ⇤
µ⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤also given
f⇤
Ok, so we have just modelled f:
f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)

f
f⇤
⇠
0
B
B
B
B
@

0
0
2
6
6
6
6
4
2
4
K11 K12 K13
K21 K22 K23
K31 K32 K33
3
5
2
4
K1⇤
K2⇤
K3⇤
3
5
⇄
K⇤1 K⇤2 K⇤3
⇤
[K⇤⇤]
3
7
7
7
7
5
1
C
C
C
C
A
K
1
Ki⇤
K⇤i
µ⇤
µ⇤
µ⇤
z⇤z⇤ z⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
What is ?
f1
f2
f3
Zz1 z2 z3
F(z)
Given: {(f1, z1); (f2, z2); (f3, z3)}
z⇤also given
f⇤
µ⇤
µ⇤
µ⇤
z⇤z⇤ z⇤
µ⇤= µ(z⇤) + KT
⇤ K 1
(f µf )
⇤ = K⇤⇤ KT
⇤ K 1
K⇤
Pros:
1. Can model almost any function directly
3. Provides uncertainty estimates
2. Can be made more flexible with different kernels
Cons:
1. Cannot be interpreted
2. Loose efficiency in high dimensional spaces
3. Overfitting
Cat or Dog?
ā€œIt’s always seemed obvious to me that it’s better to know that
you don’t know, than to think you know and act on wrong
information.ā€
Katherine Bailey
Teaching statistics Doing statistics
Resources:
Katherine Bailey’s presentation: http://katbailey.github.io/gp_talk/
Gaussian_Processes.pdf
Katherine Bailey’s blog post: from both sides now: the math of linear
regression (http://katbailey.github.io/post/from-both-sides-now-the-
math-of-linear-regression/)
Katherine Bailey’s blog post: Gaussian processes for dummies (http://
katbailey.github.io/post/gaussian-processes-for-dummies/)
Kevin P. Murphy’s book: Machine Learning - A Probabilistic
Perspective, Chapter 15 (https://www.amazon.com/Machine-Learning-
Probabilistic-Perspective-Computation/dp/0262018020)
Alex Bridgland’s blog post: Introduction to Gaussian Processes - Part I
(http://bridg.land/posts/gaussian-processes-1)
Nando de Freitas, Machine Learning - Introduction to Gaussian
Processes (https://youtu.be/4vGiHC35j9s)
in class
Under the review

Introduction to Gaussian Processes

  • 1.
  • 3.
    x f(x) y Let’stake a look inside
  • 4.
  • 5.
    x Let be linearfunctiony= f(x) y= āœ“0 + āœ“1x y
  • 6.
    x Let be linearfunctiony= f(x) y= āœ“0 + āœ“1x y
  • 7.
    x Let be linearfunctiony= f(x) arg min āœ“ nX i=1 (yi ˆyi)2 ˆyi = āœ“0 + āœ“1xi yi = āœ“0 + āœ“1xi + āœi arg min āœ“ nX i=1 (yi ˆyi)2 ˆyi = āœ“0 + āœ“1xi yi = āœ“0 + āœ“1xi + āœi Find and by optimising error arg min āœ“ nX i=1 (yi ˆyi)2 ˆyi = āœ“0 + āœ“1xi y = āœ“ + āœ“ x + āœ arg min āœ“ nX i=1 (yi ˆyi)2 ˆyi = āœ“0 + āœ“1xi y = āœ“ + āœ“ x + āœ y= āœ“0 + āœ“1x y
  • 8.
    x But if datais not linear? y
  • 9.
    x But if datais not linear? y= āœ“0 + āœ“1x y
  • 10.
    x y= āœ“0 +āœ“1x + āœ“2x2 But if data is not linear? y
  • 11.
    x But if datais not linear? y= āœ“0 + āœ“1x + āœ“2x2 + āœ“3x3 y
  • 12.
    x What if don’twant to assume a specific form? y
  • 13.
    x GPs let youmodel any function directly y
  • 14.
    x y x y Parametric ML NonparametricML A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. y= āœ“0 + āœ“1x
  • 15.
    x y x y Parametric ML NonparametricML A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. y= āœ“0 + āœ“1x Question: is K-nearest neighbour parametric or nonparametric algorithm according to these definitions?
  • 16.
    x GPs let youmodel any function directly y
  • 17.
    x y GPs let youmodel any function directly estimates the uncertainty for each new prediction
  • 18.
    x y If I askyou to predict for xi xiyi
  • 19.
    x y If I askyou to predict for ? xiyi You better be very uncertain xi
  • 20.
    How is iteven possible?
  • 21.
    We will need Normaldistribution x xi y
  • 22.
    µ 1p 2⇔ e (x µ)2 2 2 Withaverage coordinate and standard deviation from centre µ Many important processes follow normal distribution
  • 23.
    µ 1p 2⇔ e (x µ)2 2 2N(µ,2 ) With average coordinate and standard deviation from centre µ Many important processes follow normal distribution
  • 24.
    X1 ⇠ N(µ1,2 1)1p 2⇔ e (x µ)2 2 2 With average coordinate and standard deviation from centre µ Many important processes follow normal distribution µ1 1
  • 25.
    1p 2⇔ e (x µ)2 2 2 WhatIf I draw another distribution? With average coordinate and standard deviation from centre µ Many important processes follow normal distribution X1 ⇠ N(µ1, 2 1) µ1 1
  • 26.
    X1 ⇠ N(µ1,2 1) X2 ⇠ N(µ2, 2 2) µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0
  • 27.
    X1 X20 0 X2⇠ N(0, 1)X1 ⇠ N(0, 1) µ1 = 0 1 = 1 µ2 = 0 2 = 1
  • 28.
    µ1 = 01 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
  • 29.
    µ1 = 01 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
  • 30.
    µ1 = 01 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
  • 31.
    µ1 = 01 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
  • 32.
    µ1 = 01 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) What if we would join them into one plot?
  • 33.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
  • 34.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
  • 35.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
  • 36.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
  • 37.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
  • 38.
    X2 X1 µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X =  x1 x2 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
  • 39.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† Joint distribution of variables and X =  x1 x2 x1 x2 X2 X1
  • 40.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1) X2 X1
  • 41.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1) Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 X1
  • 42.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1
  • 43.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1
  • 44.
    µ1 = 01 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1 Covariance matrix or ⌃
  • 45.
    µ1 = 01 = 1 µ2 = 0 2 = 1 Joint distribution of variables and 0 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1
  • 46.
    µ1 = 01 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1
  • 47.
    µ1 = 01 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1 Similarity
  • 48.
     x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 01 ā—† X10 0 X2 X10 0 X2  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Positive value of does not tell much about X1 X2 Some similarity (correlation) Positive value of with good probability means positive X1 X2 No similarity (no correlation)
  • 49.
    µ1 = 01 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0 0 1 ā—† X2 ⇠ N(0, 1) X2 X1
  • 50.
    µ1 = 01 = 1 µ2 = 0 2 = 1 Joint distribution of variables and 0 x1 x2 X1 ⇠ N(0, 1) X2 ⇠ N(0, 1) X2 X1  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—†
  • 51.
    X2 X1 Joint distribution ofvariables andx1 x2 P(x1, x2)  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  11 12 21 22 ā—†
  • 52.
    X2 X1 Joint distribution ofvariables andx1 x2 P(x1, x2) x2  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  11 12 21 22 ā—†
  • 53.
    X2 X1 Joint distribution ofvariables andx1 x2 P(x1, x2) x2  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  11 12 21 22 ā—†
  • 54.
    X2 X1 Joint distribution ofvariables andx1 x2 P(x1, x2) x2 1|2 µ1|2  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  11 12 21 22 ā—†
  • 55.
    X2 X1 Joint distribution ofvariables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2)  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  11 12 21 22 ā—†
  • 56.
     x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21
  • 57.
    X2 X1 x2 1|2 µ1|2  x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† Conditional distribution P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 P(x1, x2) Joint distribution  x1 x2 ⇠ N āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† X10 0 X2 N(µ, 2 ) Normal distribution or 1D Gaussian or 2D Gaussian
  • 58.
  • 59.
  • 60.
  • 61.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13) Sampling Samples from 2D Gaussian 0 1st 2nd 1 1
  • 62.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13) Sampling Samples from 2D Gaussian 0 1st 2nd 1 1
  • 63.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
  • 64.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
  • 65.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
  • 66.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) There is little dependency
  • 67.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† There is little dependency
  • 68.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) There is little dependency Samples from 2D Gaussian
  • 69.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) There is little dependency 0 1st 2nd 1 1 Samples from 2D Gaussian
  • 70.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) There is little dependency 0 1st 2nd 1 1 Samples from 2D Gaussian
  • 71.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
  • 72.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
  • 73.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
  • 74.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24) More dependent values
  • 75.
    2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0 0 1 ā—† (-0.23,1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24) More dependent values How would a sample from 20D Gaussian look like?
  • 76.
  • 77.
  • 78.
    20D Gaussian 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0 0. . . 0 0 1 0 . . . 0 ... ... ... ... ... 0 0 0 . . . 1 3 7 7 7 5 1 C C C A Sampling (0.73, -0.12, 0.42, 1.2,…, 16 more) 0 1st 2nd 1 1 3rd 4th 5th 6th 7th
  • 79.
    20D Gaussian Let’s add moredependency between points 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
  • 80.
    20D Gaussian Let’s add moredependency between points (0.73, 0.18, 0.68, -0.2,…, 16 more) 0 1st 2nd 1 1 3rd 4th 5th 6th 7th 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
  • 81.
    20D Gaussian Let’s add moredependency between points 0 1st 2nd 1 1 3rd 4th 5th 6th 7th We want some notion of smoothness between points… 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
  • 82.
    20D Gaussian Let’s add moredependency between points 0 1st 2nd 1 1 3rd 4th 5th 6th 7th We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
  • 83.
    20D Gaussian Let’s add moredependency between points We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A We might have just increased corresponding values in covariance matrix, right?
  • 84.
    20D Gaussian Let’s add moredependency between points We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A We might have just increased corresponding values in covariance matrix, right? We need a way to generate a ā€œsmoothā€ covariance matrix automatically depending on the distance between points
  • 85.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 20D Gaussian
  • 86.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 20D Gaussian 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K120 K21 K22 K23 . . . K220 ... ... ... ... ... K201 K202 K203 . . . K2020 3 7 7 7 5 1 C C C A 0 1st 2nd 1 1 3rd 4th 5th 6th 7th
  • 87.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 88.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 89.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 90.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 91.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 92.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 µ⇤ ⇤ 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 93.
    We will usea similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 µ⇤ ⇤ 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
  • 94.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3
  • 95.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Previously we were using: to generate correlated points, can we do it again here?  f1 f2 ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—†
  • 96.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Wait! But now we have three points, we cannot use the same formula!  f1 f2 ⇠ āœ“ļ£æ 0 0  1 0.5 0.5 1 ā—† Previously we were using: to generate correlated points, can we do it again here?
  • 97.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A
  • 98.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A Wait, did he just said that f2 should be more correlated to f1 than to f3?
  • 99.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A Wait, did he just said that f2 should be more correlated to f1 than to f3? Arrrr….
  • 100.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A
  • 101.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A Yes, but what if we want to obtain this matrix automatically based on how close points are by (Z)?
  • 102.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A Yes, but what if we want to obtain this matrix automatically based on how close points are by (Z)? We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj
  • 103.
    f1 f2 f3 We are interestedin modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj So now, it will become: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A
  • 104.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
  • 105.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
  • 106.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
  • 107.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A
  • 108.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K)
  • 109.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*?
  • 110.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*? Well, probably again some kinda normal…
  • 111.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*? Well, probably again some kinda normal… Maybe something like: f⇤ ⇠ N(0, ?)
  • 112.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?)
  • 113.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this ā€œ?ā€ covariance matrix of z* with z*?
  • 114.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this ā€œ?ā€ covariance matrix of z* with z*? f⇤ ⇠ N(0, K⇤⇤)
  • 115.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this ā€œ?ā€ covariance matrix of z* with z*? f⇤ ⇠ N(0, K⇤⇤) But isn’t K** is just 1? K⇤⇤ = e ||z⇤ z⇤||2 = 1
  • 116.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
  • 117.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
  • 118.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A
  • 119.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1
  • 120.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Only one entity is left: K1⇤ = K(z1, z⇤)
  • 121.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Only one entity is left: K1⇤ = K(z1, z⇤) I guess we know how to calculate this one! Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj
  • 122.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it!
  • 123.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it! Wait… but what we do now?
  • 124.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it! Wait… but what we do now? Remember….
  • 125.
     x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21
  • 126.
     x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 What if we substitute x1 with f* and x2 with f?
  • 127.
     x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 Then we can compute mean and standard deviation of f*! What if we substitute x1 with f* and x2 with f?
  • 128.
     x1 x2 i ⇠ N āœ“ļ£æ µ1 µ2  1112 21 22 ā—† X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 Then we can compute mean and standard deviation of f*! Exactly! What if we substitute x1 with f* and x2 with f?
  • 129.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf )
  • 130.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf )
  • 131.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 132.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 133.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ z⇤ z⇤ µ⇤ µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 134.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ z⇤ z⇤ µ⇤ µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 135.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇄ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤ µ⇤ z⇤z⇤ z⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 136.
    What is ? f1 f2 f3 Zz1z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤also given f⇤ µ⇤ µ⇤ µ⇤ z⇤z⇤ z⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
  • 137.
    Pros: 1. Can modelalmost any function directly 3. Provides uncertainty estimates 2. Can be made more flexible with different kernels Cons: 1. Cannot be interpreted 2. Loose efficiency in high dimensional spaces 3. Overfitting
  • 138.
    Cat or Dog? ā€œIt’salways seemed obvious to me that it’s better to know that you don’t know, than to think you know and act on wrong information.ā€ Katherine Bailey
  • 139.
  • 140.
    Resources: Katherine Bailey’s presentation:http://katbailey.github.io/gp_talk/ Gaussian_Processes.pdf Katherine Bailey’s blog post: from both sides now: the math of linear regression (http://katbailey.github.io/post/from-both-sides-now-the- math-of-linear-regression/) Katherine Bailey’s blog post: Gaussian processes for dummies (http:// katbailey.github.io/post/gaussian-processes-for-dummies/) Kevin P. Murphy’s book: Machine Learning - A Probabilistic Perspective, Chapter 15 (https://www.amazon.com/Machine-Learning- Probabilistic-Perspective-Computation/dp/0262018020) Alex Bridgland’s blog post: Introduction to Gaussian Processes - Part I (http://bridg.land/posts/gaussian-processes-1) Nando de Freitas, Machine Learning - Introduction to Gaussian Processes (https://youtu.be/4vGiHC35j9s)
  • 141.