1. Chapter 6
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
2. Chapter 6. Kernel function
2
Memory based
Consider the models we covered in chapter 3 & 4.
We tried to estimate the 𝑊, p 𝐶𝑘 𝑜𝑟 𝑡 𝑊, 𝑋 or posterior distribution in Bayesian setting.
That is, as we estimate parameter or its distribution, the process is over, and we no longer need train data.
Now, recall nearest neighbor method.
Nearest neighbor method requires entire train data for not only training, but also in prediction phase.
Likewise, kernel method also requires training data points again and again!
Reason will be explained soon.
Kernel function
Kernel function can be expressed by the following equation.
𝑲 𝑿, 𝑿′
= 𝝓 𝑿 𝑻
𝝓(𝑿′)
Note that kernel function should satisfy the “symmetric condition”.
Here, important part is kernel trick.
First, let’s take a look of dual representation, which expresses parameters in terms of kernels.
3. Chapter 6.1. Dual representation
3
Dual representations
For the basic linear regression with regularization, error function can be written as…
𝐽 𝑊 =
1
2
𝑛=1
𝑁
𝑊𝑇
𝜙 𝑋 − 𝑡𝑛
2
+
𝜆
2
𝑊𝑇
𝑊
Here, as we set ∇𝐽 𝑊 = 0, we can get 𝑊 = −
1
𝜆 𝑖=1
𝑁
𝑊𝑇
𝜙 𝑋𝑛 − 𝑡𝑛 𝜙 𝑋𝑛 = 𝑛=1
𝑁
𝑎𝑛𝜙 𝑋𝑛 = Φ𝑇
𝑎
𝑎 = −
1
𝜆
𝑊𝑇
𝜙 𝑋𝑛 − 𝑡𝑛 , Φ ∶ Design matrix which has 𝑛𝑡ℎ
row is 𝜙 𝑋𝑛
𝑇
We can rewrite 𝐽 𝑊 , replacing W to W = Φ𝑇
𝑎.
Here, let’s define gram matrix as 𝑲 = 𝜱𝜱𝑻
!
Here, get the gradient of 𝛻𝐽 𝑎 = 0, we can get 𝑎 = 𝐾 + 𝜆𝐼𝑁
−1
𝒕.
Thus, what we get is…
What the heck is this??
We will discuss in the following part.
4. Chapter 6.1. Dual representation
4
Value prediction
We can re-write the entire process by…
This result indicates amazing result.
Consider the non-linear mapping function 𝜙(𝑋).
If this function is 𝜙 𝑋 = {(𝑥1+𝑥2 + ⋯ + 𝑥𝑛), (𝑥1
2
+ 𝑥2
2
+ ⋯ + 𝑥𝑛
2
), … , (𝑥1
100
+ 𝑥2
100
+ ⋯ + 𝑥𝑛
100
)}
Can we compute
That is, we can make prediction 𝒚(𝑿𝒏) without computing 𝝓(𝑿).
5. Chapter 6.2. Constructing kernels
5
Naïve approach
1st method : Just compute 𝜙(𝑋)!
Let’s understand this method via example!
Inefficient!
Kernel trick
Above is the phi equation.
But do we need this? No!
We only need kernel function value, and we don’t
need exact phi value to compute kernel function!!!
This kind of kernel which does not require exact value of 𝜙(𝑋) is called “valid kernel”
Condition for valid kernel :
Gram-matrix K should be positive semi-definite!
6. Chapter 6.2. Constructing kernels
6
Useful application
Famous kernel examples :
1. Polynomial kernel
𝐾 𝑋, 𝑋′
= 𝑋𝑇
𝑋′
+ 𝑐 𝑀
2. Gaussian kernel
𝐾 𝑋, 𝑋′
= exp(−
𝑋 − 𝑋′
2𝜎2
)
** proof
𝑋 − 𝑋′ 2
= 𝑋𝑇
𝑋 + 𝑋′ 𝑇
𝑋′
− 2𝑋𝑇
𝑋′, using this…
𝐾 𝑋, 𝑋′
= exp −
𝑋𝑇𝑋
2𝜎2 exp
𝑋𝑇𝑋′
2𝜎2 exp(−
𝑋′𝑇𝑋′
2𝜎2 ), by using (6.14) & (6.16)
Here, we can replace inner product to with a non-linear kernel again!
= Kernel inside a kernel!
7. Chapter 6.2. Constructing kernels
7
Kernel example : Probabilistic generative model
This was covered in stochastic process!! Hidden Markov chain!
𝐾 𝑋, 𝑋′
= 𝑝 𝑋 𝑝 𝑋′
=
𝑖
𝑝 𝑋 𝑖 𝑝 𝑋′
𝑖 𝑝(𝑖)
𝐾 𝑋, 𝑋′
= 𝑝 𝑋 𝑧 𝑝 𝑋′
𝑧 𝑝 𝑧 𝑑𝑧 𝐾 𝑿, 𝑿′
=
𝑍
𝑝 𝑋 𝑍 𝑝 𝑋′
𝑍 𝑝(𝑍)
Kernel example : Fisher kernel (Using fisher information)
This was covered in mathematical statistics 2
𝑔 𝜃, 𝑥 = ∇𝜃 ln 𝑝(𝑋|𝜃)
𝑲 𝑿, 𝑿′
= 𝒈 𝜽, 𝑿 𝑻
(𝑭−𝟏
)𝒈(𝜽, 𝑿′)
Reason for dividing fisher information is that,
“It makes this kernel to be invariant under a non-linear reparameterization” (Anyone understood?)
In fact, it is really hard to compute fisher information matrix!
So, we are using approximation.
8. Chapter 6.3. Radial basis function
8
Radial basis kernel (also known as gaussian kernel)
It was originally proposed to generate almost exact approximation of training set.
Many people say, “overfitting is really bad!”. Then, without using deep neural network, can you make overfitting model???
Radial basis function can!
Below example shows the noisy example. Idea is very clear for me, but contents are not so clear at all…
This idea will be covered in detail in
following chapter!
9. Chapter 6.3. Radial basis function
9
Nadaraya-Watson Model
Consider the following component density function
Here, let 𝑓(𝑥, 𝑡) be the component
Density function!
https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models/
Then, what we want to compute is 𝒚 𝑿 = 𝑬(𝒕|𝑿).
Why? Because we want to generate prediction for the given input!
Here, we assume
This means that our prediction is the weighted mean of each kernel value!
10. Chapter 6.3. Radial basis function
10
Nadaraya-Watson Model
Consider the following component density function
Obviously, not only the prediction, but also predictive distribution can be generated!
Extension of this model is the gaussian mixture model!
Most famous clustering methods with K-means!!
11. Chapter 6.4. Gaussian process
11
Idea of gaussian process in linear regression
Consider the following component density function
Simple linear regression model can be expressed as 𝑦(𝑋, 𝑊) = 𝑊𝑇
𝜙(𝑋)
From this, we are generating predictive distribution 𝑝 𝑡 𝑋
In gaussian process model, we are not getting help of W. We are using the function directly.
Meaning of this will be covered soon!!
Basic linear regression : 𝑦 𝑋 = 𝑊𝑇
𝜙 𝑋 / In matrix form : 𝒀 = 𝚽𝑾
Prior : 𝑝 𝑊 = 𝑁 𝑊 0, 𝛼−1
𝐼)
Combining these two, we can derive…
𝐸 𝑌 = Φ𝐸 𝑊 = 0
𝐶𝑜𝑣 𝑌 = 𝐸 𝑌𝑌𝑇
= Φ𝐸 𝑊𝑊𝑇
Φ𝑇
= 𝛼−1
ΦΦ𝑇
= 𝐾, since 𝐸 𝑊𝑊𝑇
= 𝑐𝑜𝑣(𝑊)
From this, we can find the probability distribution over functions 𝑦(𝑋).
Note that we are setting mean = zero in prior distribution!
One popular choice for kernel is a gaussian kernel that…
12. Chapter 6.4. Gaussian process
12
Idea of gaussian process in linear regression
Linear regression that is familiar to us!
𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, where 𝑦𝑛 = 𝑦(𝑋𝑛), then predictive distribution will be…
𝑝 𝑡𝑛 𝑦𝑛 = 𝑁 𝑡𝑛 𝑦𝑛, 𝛽−1
= 𝑝 𝒕 𝒚 = 𝑁(𝒕|𝒚, 𝛽−1
𝑰𝑵)
As we covered in last section, distribution of y is given as 𝑝 𝑦 = 𝑁(𝑦|0, 𝐾)
We know 𝑝(𝑡|𝑦) and 𝑝(𝑦), we can derive 𝑝(𝑡) by
Then, we have to choose the kernel K.
Kernel should satisfy * if input x are similar, then outcome K(x) should also be similar!
One famous kernel is, and we have to estimate 𝜃0 … 𝜃3.
13. Chapter 6.4. Gaussian process
13
Making prediction with gaussian process
To generate prediction, we are estimating
𝑝(𝑡𝑁+1|t𝑁) , note that there are also 𝑋 in the conditional term! But I am going to omit the term here.
Above term can be computed as
𝑝 𝑡𝑁+1 t𝑁 =
𝑝 t𝑁+1 ∶ 𝐽𝑜𝑖𝑛𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠 𝑎𝑛𝑑 𝑛𝑒𝑤𝑙𝑦 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑋
𝑝 t𝑁 ∶ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠
Here, 𝐶𝑁 : covariance matrix of 𝑝(t𝑁) & k : 𝐾(𝑋𝑛, 𝑋𝑁+1) / Note that k is a (n x 1) column vector!
Final predictive distribution is a gaussian distribution of
𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻
𝑪𝑵
−𝟏
𝒕, 𝑐 − 𝒌𝑻
𝑪𝑵
−𝟏
𝒌)
𝑝(𝒕) = 𝑁(𝒕|0, 𝐶)
Note that covariance matrix is positive semi-definite, thus C matrix’s eigenvalue should be 𝝀𝒊 ≥ 𝟎 (𝜷 > 𝟎 means equal to zero is fine!)
14. Chapter 6.4. Gaussian process
14
Making prediction with gaussian process
This is very interesting that the predictive distribution is affected by the distribution of X!
a. If x is densely distributed, the confidence interval is relatively narrow!
b. If x is sparsely distributed, the confidence interval is relative wide!
This phenomenon reflects our intuition amazingly!!!
15. Chapter 6.4. Gaussian process
15
Interpretation of ARD
Large kernel value indicates it gives significant influence on output of value.
That means, the difference in that feature value is important!
Thus, it can be incorporated in feature importance view!
Left plot indicates
𝑥1 : Most significant feature!
𝑥2 : Relatively less significant and there exists noise.
𝑥3 : Almost no effects!
Above equation is a generalized version.
16. Chapter 6.4. Gaussian process
16
Estimation of 𝜽 in a kernel.
Naïve method : Using MLE for 𝑝(𝒕|𝜃)
Note that, this equation cannot be expressed in closed form.
Furthermore, we cannot guarantee whether it is convex form.
Automatic relevance determination
By using additional parameter, we can estimate which feature is significant in model prediction.
Let’s see how the distribution of kernel value is changing
by perturbating value of eta.
𝜂1 = 𝜂2 = 1 𝜂1 = 1 & 𝜂2 = 0.01
17. Chapter 6.4. Gaussian process
17
Classification
For classification, we should generate a value between (0, 1). Things go in similar manner!
First, we are not directly using parameter vector W. (it was 𝑝 𝑋 =
1
1+𝑒−𝑊𝑇𝑋
, but we are not estimating this anymore!)
Instead, we directly generate a distribution of 𝑎𝑁+1 = (𝑎1, 𝑎2, … , 𝑎𝑁+1), * N+1 is a vector to be used as an input!
Now, we are not having noise term anymore. (For regression, equation was 𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, this epsilon does not exist anymore!!)
Regression case Classification case
Here, 𝑝 𝑡𝑁+1 𝑎𝑁+1 = 𝜎(𝑎𝑁+1).
Here, integral is relatively hard.
Thus, we are using approximation.
But first, we have to estimate 𝒑(𝒂𝑵+𝟏| 𝒕𝑵).
18. Chapter 6.4. Gaussian process
18
Laplace approximation with classification
For classification, similar method is being applied,
For regression, probability was given as
𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻
𝑪𝑵
−𝟏
𝒕, 𝑐 − 𝒌𝑻
𝑪𝑵
−𝟏
𝒌)
Now, we should compute 𝑝(𝑎𝑁|𝑡𝑁). This is our interest!
It is hard to derive the exact formation of this probability, so we are using Laplace approximation!
We are not interested in 𝒕𝑵, thus we are throwing it away! Remaining term is… (log 𝑝(𝑎𝑁|𝑡𝑁) = 𝚿(𝐚𝑵))
19. Chapter 6.4. Gaussian process
19
Laplace approximation with classification
Entire process is really complicated! Thus, let’s take a deep breath and check what we are doing!
We are trying to compute 𝒑 𝒂𝑵+𝟏 𝒕𝒏 = 𝒑 𝒂𝑵+𝟏 𝒂𝑵 𝒑(𝒂𝑵|𝒕𝑵) 𝒅𝒂𝑵
But this integral is very hard, thus we are trying to approximate the inner pdf to Gaussian by using laplace approximation!
Here, we know p(aN+1|aN), but we don’t know 𝑝(𝑎𝑁|𝑡𝑁). So, we are approximating this to gaussian function!
To use laplace approximation, we need mode value. We gotta find maximum part by using 1st order derivative!
Note that 𝜎𝑁 includes 𝑎𝑁 value, thus this equation cannot be expressed in closed form… Again we use iterative manner (Newton - Raphson)
Here, 𝑊𝑁 denotes the diagonal matric whose values are 𝜎(𝑎𝑁)(1 − 𝜎(𝑎𝑁)).
Note that 0 ≤ 𝑊𝑁 ≤ 0.25, 𝐶𝑁
−1
is positive semi definite, thus A = −∇∇Ψ(𝑎𝑁) is positive definite!
Thus, 𝒑(𝒂𝑵|𝒕𝑵) is log convex → We can reach minimum zone.
20. Chapter 6.4. Gaussian process
20
Laplace approximation with classification
So, we can reach minimum 𝑎𝑁 by Newton-Raphson method!
Approximation!
𝑝 𝑎𝑁 𝑡𝑁 ≈
Our goal : 𝒑(𝒂𝑵+𝟏|𝒕𝑵)
From this, we can derive that
𝐴 = 𝑘𝑇
𝐶𝑁
−1
𝑏 = 0 𝜇 = 𝐶𝑁(𝑡𝑁 − 𝜎𝑁) Λ−1
= 𝐻−1
𝐿−1
= 𝑐 − 𝑘𝑇
𝐶𝑁
−1
𝑘
Putting all these values, we get
Finally, we can compute gaussian
distribution with mean and variance of
21. Chapter 6.4. Gaussian process
21
Finding 𝜽 values in kernel function
Approximation!
Density function is given as
Note that we are still having parameter in kernel function.
In order to estimate 𝜃, we use MLE method.
They calculate the gradient sequentially. (Is it possible…?)
𝜳(𝒂𝑵
★) can be expressed by
Ψ 𝑎𝑁
★ = −
1
2
𝑎★𝑇
𝐶𝑁
−1
𝑎★ −
1
2
ln 𝐶𝑁 + 𝑡𝑁
𝑇
𝑎𝑁
★
By using the following terms…
In fact,
I cannot understand how they are being
combined!! Did anyone understand??