SlideShare a Scribd company logo
1 of 21
Chapter 6
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 6. Kernel function
2
Memory based
Consider the models we covered in chapter 3 & 4.
We tried to estimate the 𝑊, p 𝐶𝑘 𝑜𝑟 𝑡 𝑊, 𝑋 or posterior distribution in Bayesian setting.
That is, as we estimate parameter or its distribution, the process is over, and we no longer need train data.
Now, recall nearest neighbor method.
Nearest neighbor method requires entire train data for not only training, but also in prediction phase.
Likewise, kernel method also requires training data points again and again!
Reason will be explained soon.
Kernel function
Kernel function can be expressed by the following equation.
𝑲 𝑿, 𝑿′
= 𝝓 𝑿 𝑻
𝝓(𝑿′)
Note that kernel function should satisfy the “symmetric condition”.
Here, important part is kernel trick.
First, let’s take a look of dual representation, which expresses parameters in terms of kernels.
Chapter 6.1. Dual representation
3
Dual representations
For the basic linear regression with regularization, error function can be written as…
𝐽 𝑊 =
1
2
𝑛=1
𝑁
𝑊𝑇
𝜙 𝑋 − 𝑡𝑛
2
+
𝜆
2
𝑊𝑇
𝑊
Here, as we set ∇𝐽 𝑊 = 0, we can get 𝑊 = −
1
𝜆 𝑖=1
𝑁
𝑊𝑇
𝜙 𝑋𝑛 − 𝑡𝑛 𝜙 𝑋𝑛 = 𝑛=1
𝑁
𝑎𝑛𝜙 𝑋𝑛 = Φ𝑇
𝑎
𝑎 = −
1
𝜆
𝑊𝑇
𝜙 𝑋𝑛 − 𝑡𝑛 , Φ ∶ Design matrix which has 𝑛𝑡ℎ
row is 𝜙 𝑋𝑛
𝑇
We can rewrite 𝐽 𝑊 , replacing W to W = Φ𝑇
𝑎.
Here, let’s define gram matrix as 𝑲 = 𝜱𝜱𝑻
!
Here, get the gradient of 𝛻𝐽 𝑎 = 0, we can get 𝑎 = 𝐾 + 𝜆𝐼𝑁
−1
𝒕.
Thus, what we get is…
What the heck is this??
We will discuss in the following part.
Chapter 6.1. Dual representation
4
Value prediction
We can re-write the entire process by…
This result indicates amazing result.
Consider the non-linear mapping function 𝜙(𝑋).
If this function is 𝜙 𝑋 = {(𝑥1+𝑥2 + ⋯ + 𝑥𝑛), (𝑥1
2
+ 𝑥2
2
+ ⋯ + 𝑥𝑛
2
), … , (𝑥1
100
+ 𝑥2
100
+ ⋯ + 𝑥𝑛
100
)}
Can we compute
That is, we can make prediction 𝒚(𝑿𝒏) without computing 𝝓(𝑿).
Chapter 6.2. Constructing kernels
5
Naïve approach
1st method : Just compute 𝜙(𝑋)!
Let’s understand this method via example!
Inefficient!
Kernel trick
Above is the phi equation.
But do we need this? No!
We only need kernel function value, and we don’t
need exact phi value to compute kernel function!!!
This kind of kernel which does not require exact value of 𝜙(𝑋) is called “valid kernel”
Condition for valid kernel :
Gram-matrix K should be positive semi-definite!
Chapter 6.2. Constructing kernels
6
Useful application
Famous kernel examples :
1. Polynomial kernel
𝐾 𝑋, 𝑋′
= 𝑋𝑇
𝑋′
+ 𝑐 𝑀
2. Gaussian kernel
𝐾 𝑋, 𝑋′
= exp(−
𝑋 − 𝑋′
2𝜎2
)
** proof
𝑋 − 𝑋′ 2
= 𝑋𝑇
𝑋 + 𝑋′ 𝑇
𝑋′
− 2𝑋𝑇
𝑋′, using this…
𝐾 𝑋, 𝑋′
= exp −
𝑋𝑇𝑋
2𝜎2 exp
𝑋𝑇𝑋′
2𝜎2 exp(−
𝑋′𝑇𝑋′
2𝜎2 ), by using (6.14) & (6.16)
Here, we can replace inner product to with a non-linear kernel again!
= Kernel inside a kernel!
Chapter 6.2. Constructing kernels
7
Kernel example : Probabilistic generative model
This was covered in stochastic process!! Hidden Markov chain!
𝐾 𝑋, 𝑋′
= 𝑝 𝑋 𝑝 𝑋′
=
𝑖
𝑝 𝑋 𝑖 𝑝 𝑋′
𝑖 𝑝(𝑖)
𝐾 𝑋, 𝑋′
= 𝑝 𝑋 𝑧 𝑝 𝑋′
𝑧 𝑝 𝑧 𝑑𝑧 𝐾 𝑿, 𝑿′
=
𝑍
𝑝 𝑋 𝑍 𝑝 𝑋′
𝑍 𝑝(𝑍)
Kernel example : Fisher kernel (Using fisher information)
This was covered in mathematical statistics 2
𝑔 𝜃, 𝑥 = ∇𝜃 ln 𝑝(𝑋|𝜃)
𝑲 𝑿, 𝑿′
= 𝒈 𝜽, 𝑿 𝑻
(𝑭−𝟏
)𝒈(𝜽, 𝑿′)
Reason for dividing fisher information is that,
“It makes this kernel to be invariant under a non-linear reparameterization” (Anyone understood?)
In fact, it is really hard to compute fisher information matrix!
So, we are using approximation.
Chapter 6.3. Radial basis function
8
Radial basis kernel (also known as gaussian kernel)
It was originally proposed to generate almost exact approximation of training set.
Many people say, “overfitting is really bad!”. Then, without using deep neural network, can you make overfitting model???
Radial basis function can!
Below example shows the noisy example. Idea is very clear for me, but contents are not so clear at all…
This idea will be covered in detail in
following chapter!
Chapter 6.3. Radial basis function
9
Nadaraya-Watson Model
Consider the following component density function
Here, let 𝑓(𝑥, 𝑡) be the component
Density function!
https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models/
Then, what we want to compute is 𝒚 𝑿 = 𝑬(𝒕|𝑿).
Why? Because we want to generate prediction for the given input!
Here, we assume
This means that our prediction is the weighted mean of each kernel value!
Chapter 6.3. Radial basis function
10
Nadaraya-Watson Model
Consider the following component density function
Obviously, not only the prediction, but also predictive distribution can be generated!
Extension of this model is the gaussian mixture model!
Most famous clustering methods with K-means!!
Chapter 6.4. Gaussian process
11
Idea of gaussian process in linear regression
Consider the following component density function
Simple linear regression model can be expressed as 𝑦(𝑋, 𝑊) = 𝑊𝑇
𝜙(𝑋)
From this, we are generating predictive distribution 𝑝 𝑡 𝑋
In gaussian process model, we are not getting help of W. We are using the function directly.
Meaning of this will be covered soon!!
Basic linear regression : 𝑦 𝑋 = 𝑊𝑇
𝜙 𝑋 / In matrix form : 𝒀 = 𝚽𝑾
Prior : 𝑝 𝑊 = 𝑁 𝑊 0, 𝛼−1
𝐼)
Combining these two, we can derive…
𝐸 𝑌 = Φ𝐸 𝑊 = 0
𝐶𝑜𝑣 𝑌 = 𝐸 𝑌𝑌𝑇
= Φ𝐸 𝑊𝑊𝑇
Φ𝑇
= 𝛼−1
ΦΦ𝑇
= 𝐾, since 𝐸 𝑊𝑊𝑇
= 𝑐𝑜𝑣(𝑊)
From this, we can find the probability distribution over functions 𝑦(𝑋).
Note that we are setting mean = zero in prior distribution!
One popular choice for kernel is a gaussian kernel that…
Chapter 6.4. Gaussian process
12
Idea of gaussian process in linear regression
Linear regression that is familiar to us!
𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, where 𝑦𝑛 = 𝑦(𝑋𝑛), then predictive distribution will be…
𝑝 𝑡𝑛 𝑦𝑛 = 𝑁 𝑡𝑛 𝑦𝑛, 𝛽−1
= 𝑝 𝒕 𝒚 = 𝑁(𝒕|𝒚, 𝛽−1
𝑰𝑵)
As we covered in last section, distribution of y is given as 𝑝 𝑦 = 𝑁(𝑦|0, 𝐾)
We know 𝑝(𝑡|𝑦) and 𝑝(𝑦), we can derive 𝑝(𝑡) by
Then, we have to choose the kernel K.
Kernel should satisfy * if input x are similar, then outcome K(x) should also be similar!
One famous kernel is, and we have to estimate 𝜃0 … 𝜃3.
Chapter 6.4. Gaussian process
13
Making prediction with gaussian process
To generate prediction, we are estimating
𝑝(𝑡𝑁+1|t𝑁) , note that there are also 𝑋 in the conditional term! But I am going to omit the term here.
Above term can be computed as
𝑝 𝑡𝑁+1 t𝑁 =
𝑝 t𝑁+1 ∶ 𝐽𝑜𝑖𝑛𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠 𝑎𝑛𝑑 𝑛𝑒𝑤𝑙𝑦 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑋
𝑝 t𝑁 ∶ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠
Here, 𝐶𝑁 : covariance matrix of 𝑝(t𝑁) & k : 𝐾(𝑋𝑛, 𝑋𝑁+1) / Note that k is a (n x 1) column vector!
Final predictive distribution is a gaussian distribution of
𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻
𝑪𝑵
−𝟏
𝒕, 𝑐 − 𝒌𝑻
𝑪𝑵
−𝟏
𝒌)
𝑝(𝒕) = 𝑁(𝒕|0, 𝐶)
Note that covariance matrix is positive semi-definite, thus C matrix’s eigenvalue should be 𝝀𝒊 ≥ 𝟎 (𝜷 > 𝟎 means equal to zero is fine!)
Chapter 6.4. Gaussian process
14
Making prediction with gaussian process
This is very interesting that the predictive distribution is affected by the distribution of X!
a. If x is densely distributed, the confidence interval is relatively narrow!
b. If x is sparsely distributed, the confidence interval is relative wide!
This phenomenon reflects our intuition amazingly!!!
Chapter 6.4. Gaussian process
15
Interpretation of ARD
Large kernel value indicates it gives significant influence on output of value.
That means, the difference in that feature value is important!
Thus, it can be incorporated in feature importance view!
Left plot indicates
𝑥1 : Most significant feature!
𝑥2 : Relatively less significant and there exists noise.
𝑥3 : Almost no effects!
Above equation is a generalized version.
Chapter 6.4. Gaussian process
16
Estimation of 𝜽 in a kernel.
Naïve method : Using MLE for 𝑝(𝒕|𝜃)
Note that, this equation cannot be expressed in closed form.
Furthermore, we cannot guarantee whether it is convex form.
Automatic relevance determination
By using additional parameter, we can estimate which feature is significant in model prediction.
Let’s see how the distribution of kernel value is changing
by perturbating value of eta.
𝜂1 = 𝜂2 = 1 𝜂1 = 1 & 𝜂2 = 0.01
Chapter 6.4. Gaussian process
17
Classification
For classification, we should generate a value between (0, 1). Things go in similar manner!
First, we are not directly using parameter vector W. (it was 𝑝 𝑋 =
1
1+𝑒−𝑊𝑇𝑋
, but we are not estimating this anymore!)
Instead, we directly generate a distribution of 𝑎𝑁+1 = (𝑎1, 𝑎2, … , 𝑎𝑁+1), * N+1 is a vector to be used as an input!
Now, we are not having noise term anymore. (For regression, equation was 𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, this epsilon does not exist anymore!!)
Regression case Classification case
Here, 𝑝 𝑡𝑁+1 𝑎𝑁+1 = 𝜎(𝑎𝑁+1).
Here, integral is relatively hard.
Thus, we are using approximation.
But first, we have to estimate 𝒑(𝒂𝑵+𝟏| 𝒕𝑵).
Chapter 6.4. Gaussian process
18
Laplace approximation with classification
For classification, similar method is being applied,
For regression, probability was given as
𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻
𝑪𝑵
−𝟏
𝒕, 𝑐 − 𝒌𝑻
𝑪𝑵
−𝟏
𝒌)
Now, we should compute 𝑝(𝑎𝑁|𝑡𝑁). This is our interest!
It is hard to derive the exact formation of this probability, so we are using Laplace approximation!
We are not interested in 𝒕𝑵, thus we are throwing it away! Remaining term is… (log 𝑝(𝑎𝑁|𝑡𝑁) = 𝚿(𝐚𝑵))
Chapter 6.4. Gaussian process
19
Laplace approximation with classification
Entire process is really complicated! Thus, let’s take a deep breath and check what we are doing!
We are trying to compute 𝒑 𝒂𝑵+𝟏 𝒕𝒏 = 𝒑 𝒂𝑵+𝟏 𝒂𝑵 𝒑(𝒂𝑵|𝒕𝑵) 𝒅𝒂𝑵
But this integral is very hard, thus we are trying to approximate the inner pdf to Gaussian by using laplace approximation!
Here, we know p(aN+1|aN), but we don’t know 𝑝(𝑎𝑁|𝑡𝑁). So, we are approximating this to gaussian function!
To use laplace approximation, we need mode value. We gotta find maximum part by using 1st order derivative!
Note that 𝜎𝑁 includes 𝑎𝑁 value, thus this equation cannot be expressed in closed form… Again we use iterative manner (Newton - Raphson)
Here, 𝑊𝑁 denotes the diagonal matric whose values are 𝜎(𝑎𝑁)(1 − 𝜎(𝑎𝑁)).
Note that 0 ≤ 𝑊𝑁 ≤ 0.25, 𝐶𝑁
−1
is positive semi definite, thus A = −∇∇Ψ(𝑎𝑁) is positive definite!
Thus, 𝒑(𝒂𝑵|𝒕𝑵) is log convex → We can reach minimum zone.
Chapter 6.4. Gaussian process
20
Laplace approximation with classification
So, we can reach minimum 𝑎𝑁 by Newton-Raphson method!
Approximation!
𝑝 𝑎𝑁 𝑡𝑁 ≈
Our goal : 𝒑(𝒂𝑵+𝟏|𝒕𝑵)
From this, we can derive that
𝐴 = 𝑘𝑇
𝐶𝑁
−1
𝑏 = 0 𝜇 = 𝐶𝑁(𝑡𝑁 − 𝜎𝑁) Λ−1
= 𝐻−1
𝐿−1
= 𝑐 − 𝑘𝑇
𝐶𝑁
−1
𝑘
Putting all these values, we get
Finally, we can compute gaussian
distribution with mean and variance of
Chapter 6.4. Gaussian process
21
Finding 𝜽 values in kernel function
Approximation!
Density function is given as
Note that we are still having parameter in kernel function.
In order to estimate 𝜃, we use MLE method.
They calculate the gradient sequentially. (Is it possible…?)
𝜳(𝒂𝑵
★) can be expressed by
Ψ 𝑎𝑁
★ = −
1
2
𝑎★𝑇
𝐶𝑁
−1
𝑎★ −
1
2
ln 𝐶𝑁 + 𝑡𝑁
𝑇
𝑎𝑁
★
By using the following terms…
In fact,
I cannot understand how they are being
combined!! Did anyone understand??

More Related Content

What's hot

PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodHa Phuong
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8Sunwoo Kim
 
PRML読み会第一章
PRML読み会第一章PRML読み会第一章
PRML読み会第一章Takushi Miki
 
パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)Yukara Ikemiya
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」Keisuke Sugawara
 
PRML輪読#12
PRML輪読#12PRML輪読#12
PRML輪読#12matsuolab
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)Takao Yamanaka
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8matsuolab
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10matsuolab
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)Yasunori Ozaki
 
PRML 8.2 条件付き独立性
PRML 8.2 条件付き独立性PRML 8.2 条件付き独立性
PRML 8.2 条件付き独立性sleepy_yoshi
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)Takao Yamanaka
 
PRML輪読#4
PRML輪読#4PRML輪読#4
PRML輪読#4matsuolab
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7matsuolab
 
計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-sleepy_yoshi
 
PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半Ohsawa Goodfellow
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3matsuolab
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5Sunwoo Kim
 

What's hot (20)

PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling Method
 
Prml 10 1
Prml 10 1Prml 10 1
Prml 10 1
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8
 
PRML読み会第一章
PRML読み会第一章PRML読み会第一章
PRML読み会第一章
 
パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
Prml 2.3
Prml 2.3Prml 2.3
Prml 2.3
 
PRML輪読#12
PRML輪読#12PRML輪読#12
PRML輪読#12
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)
 
PRML 8.2 条件付き独立性
PRML 8.2 条件付き独立性PRML 8.2 条件付き独立性
PRML 8.2 条件付き独立性
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
 
PRML輪読#4
PRML輪読#4PRML輪読#4
PRML輪読#4
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7
 
計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-
 
PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML上巻勉強会 at 東京大学 資料 第1章前半
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 

Similar to PRML Chapter 6

PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7Sunwoo Kim
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4Sunwoo Kim
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12Sunwoo Kim
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9Sunwoo Kim
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notesUmeshJagga1
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1Sunwoo Kim
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From DataSungjoon Choi
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptxHarishNayak44
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation codesharma239172
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Florian Wilhelm
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3Sunwoo Kim
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapHa Phuong
 
PRML Chapter 2
PRML Chapter 2PRML Chapter 2
PRML Chapter 2Sunwoo Kim
 
Kernel Bayes Rule
Kernel Bayes RuleKernel Bayes Rule
Kernel Bayes RuleYan Xu
 
Optimization Of Fuzzy Bexa Using Nm
Optimization Of Fuzzy Bexa Using NmOptimization Of Fuzzy Bexa Using Nm
Optimization Of Fuzzy Bexa Using NmAshish Khetan
 

Similar to PRML Chapter 6 (20)

PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 
Support Vector Machine.pptx
Support Vector Machine.pptxSupport Vector Machine.pptx
Support Vector Machine.pptx
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
 
Ann a Algorithms notes
Ann a Algorithms notesAnn a Algorithms notes
Ann a Algorithms notes
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
 
PRML Chapter 2
PRML Chapter 2PRML Chapter 2
PRML Chapter 2
 
Kernel Bayes Rule
Kernel Bayes RuleKernel Bayes Rule
Kernel Bayes Rule
 
Optimization Of Fuzzy Bexa Using Nm
Optimization Of Fuzzy Bexa Using NmOptimization Of Fuzzy Bexa Using Nm
Optimization Of Fuzzy Bexa Using Nm
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

PRML Chapter 6

  • 1. Chapter 6 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 6. Kernel function 2 Memory based Consider the models we covered in chapter 3 & 4. We tried to estimate the 𝑊, p 𝐶𝑘 𝑜𝑟 𝑡 𝑊, 𝑋 or posterior distribution in Bayesian setting. That is, as we estimate parameter or its distribution, the process is over, and we no longer need train data. Now, recall nearest neighbor method. Nearest neighbor method requires entire train data for not only training, but also in prediction phase. Likewise, kernel method also requires training data points again and again! Reason will be explained soon. Kernel function Kernel function can be expressed by the following equation. 𝑲 𝑿, 𝑿′ = 𝝓 𝑿 𝑻 𝝓(𝑿′) Note that kernel function should satisfy the “symmetric condition”. Here, important part is kernel trick. First, let’s take a look of dual representation, which expresses parameters in terms of kernels.
  • 3. Chapter 6.1. Dual representation 3 Dual representations For the basic linear regression with regularization, error function can be written as… 𝐽 𝑊 = 1 2 𝑛=1 𝑁 𝑊𝑇 𝜙 𝑋 − 𝑡𝑛 2 + 𝜆 2 𝑊𝑇 𝑊 Here, as we set ∇𝐽 𝑊 = 0, we can get 𝑊 = − 1 𝜆 𝑖=1 𝑁 𝑊𝑇 𝜙 𝑋𝑛 − 𝑡𝑛 𝜙 𝑋𝑛 = 𝑛=1 𝑁 𝑎𝑛𝜙 𝑋𝑛 = Φ𝑇 𝑎 𝑎 = − 1 𝜆 𝑊𝑇 𝜙 𝑋𝑛 − 𝑡𝑛 , Φ ∶ Design matrix which has 𝑛𝑡ℎ row is 𝜙 𝑋𝑛 𝑇 We can rewrite 𝐽 𝑊 , replacing W to W = Φ𝑇 𝑎. Here, let’s define gram matrix as 𝑲 = 𝜱𝜱𝑻 ! Here, get the gradient of 𝛻𝐽 𝑎 = 0, we can get 𝑎 = 𝐾 + 𝜆𝐼𝑁 −1 𝒕. Thus, what we get is… What the heck is this?? We will discuss in the following part.
  • 4. Chapter 6.1. Dual representation 4 Value prediction We can re-write the entire process by… This result indicates amazing result. Consider the non-linear mapping function 𝜙(𝑋). If this function is 𝜙 𝑋 = {(𝑥1+𝑥2 + ⋯ + 𝑥𝑛), (𝑥1 2 + 𝑥2 2 + ⋯ + 𝑥𝑛 2 ), … , (𝑥1 100 + 𝑥2 100 + ⋯ + 𝑥𝑛 100 )} Can we compute That is, we can make prediction 𝒚(𝑿𝒏) without computing 𝝓(𝑿).
  • 5. Chapter 6.2. Constructing kernels 5 Naïve approach 1st method : Just compute 𝜙(𝑋)! Let’s understand this method via example! Inefficient! Kernel trick Above is the phi equation. But do we need this? No! We only need kernel function value, and we don’t need exact phi value to compute kernel function!!! This kind of kernel which does not require exact value of 𝜙(𝑋) is called “valid kernel” Condition for valid kernel : Gram-matrix K should be positive semi-definite!
  • 6. Chapter 6.2. Constructing kernels 6 Useful application Famous kernel examples : 1. Polynomial kernel 𝐾 𝑋, 𝑋′ = 𝑋𝑇 𝑋′ + 𝑐 𝑀 2. Gaussian kernel 𝐾 𝑋, 𝑋′ = exp(− 𝑋 − 𝑋′ 2𝜎2 ) ** proof 𝑋 − 𝑋′ 2 = 𝑋𝑇 𝑋 + 𝑋′ 𝑇 𝑋′ − 2𝑋𝑇 𝑋′, using this… 𝐾 𝑋, 𝑋′ = exp − 𝑋𝑇𝑋 2𝜎2 exp 𝑋𝑇𝑋′ 2𝜎2 exp(− 𝑋′𝑇𝑋′ 2𝜎2 ), by using (6.14) & (6.16) Here, we can replace inner product to with a non-linear kernel again! = Kernel inside a kernel!
  • 7. Chapter 6.2. Constructing kernels 7 Kernel example : Probabilistic generative model This was covered in stochastic process!! Hidden Markov chain! 𝐾 𝑋, 𝑋′ = 𝑝 𝑋 𝑝 𝑋′ = 𝑖 𝑝 𝑋 𝑖 𝑝 𝑋′ 𝑖 𝑝(𝑖) 𝐾 𝑋, 𝑋′ = 𝑝 𝑋 𝑧 𝑝 𝑋′ 𝑧 𝑝 𝑧 𝑑𝑧 𝐾 𝑿, 𝑿′ = 𝑍 𝑝 𝑋 𝑍 𝑝 𝑋′ 𝑍 𝑝(𝑍) Kernel example : Fisher kernel (Using fisher information) This was covered in mathematical statistics 2 𝑔 𝜃, 𝑥 = ∇𝜃 ln 𝑝(𝑋|𝜃) 𝑲 𝑿, 𝑿′ = 𝒈 𝜽, 𝑿 𝑻 (𝑭−𝟏 )𝒈(𝜽, 𝑿′) Reason for dividing fisher information is that, “It makes this kernel to be invariant under a non-linear reparameterization” (Anyone understood?) In fact, it is really hard to compute fisher information matrix! So, we are using approximation.
  • 8. Chapter 6.3. Radial basis function 8 Radial basis kernel (also known as gaussian kernel) It was originally proposed to generate almost exact approximation of training set. Many people say, “overfitting is really bad!”. Then, without using deep neural network, can you make overfitting model??? Radial basis function can! Below example shows the noisy example. Idea is very clear for me, but contents are not so clear at all… This idea will be covered in detail in following chapter!
  • 9. Chapter 6.3. Radial basis function 9 Nadaraya-Watson Model Consider the following component density function Here, let 𝑓(𝑥, 𝑡) be the component Density function! https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models/ Then, what we want to compute is 𝒚 𝑿 = 𝑬(𝒕|𝑿). Why? Because we want to generate prediction for the given input! Here, we assume This means that our prediction is the weighted mean of each kernel value!
  • 10. Chapter 6.3. Radial basis function 10 Nadaraya-Watson Model Consider the following component density function Obviously, not only the prediction, but also predictive distribution can be generated! Extension of this model is the gaussian mixture model! Most famous clustering methods with K-means!!
  • 11. Chapter 6.4. Gaussian process 11 Idea of gaussian process in linear regression Consider the following component density function Simple linear regression model can be expressed as 𝑦(𝑋, 𝑊) = 𝑊𝑇 𝜙(𝑋) From this, we are generating predictive distribution 𝑝 𝑡 𝑋 In gaussian process model, we are not getting help of W. We are using the function directly. Meaning of this will be covered soon!! Basic linear regression : 𝑦 𝑋 = 𝑊𝑇 𝜙 𝑋 / In matrix form : 𝒀 = 𝚽𝑾 Prior : 𝑝 𝑊 = 𝑁 𝑊 0, 𝛼−1 𝐼) Combining these two, we can derive… 𝐸 𝑌 = Φ𝐸 𝑊 = 0 𝐶𝑜𝑣 𝑌 = 𝐸 𝑌𝑌𝑇 = Φ𝐸 𝑊𝑊𝑇 Φ𝑇 = 𝛼−1 ΦΦ𝑇 = 𝐾, since 𝐸 𝑊𝑊𝑇 = 𝑐𝑜𝑣(𝑊) From this, we can find the probability distribution over functions 𝑦(𝑋). Note that we are setting mean = zero in prior distribution! One popular choice for kernel is a gaussian kernel that…
  • 12. Chapter 6.4. Gaussian process 12 Idea of gaussian process in linear regression Linear regression that is familiar to us! 𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, where 𝑦𝑛 = 𝑦(𝑋𝑛), then predictive distribution will be… 𝑝 𝑡𝑛 𝑦𝑛 = 𝑁 𝑡𝑛 𝑦𝑛, 𝛽−1 = 𝑝 𝒕 𝒚 = 𝑁(𝒕|𝒚, 𝛽−1 𝑰𝑵) As we covered in last section, distribution of y is given as 𝑝 𝑦 = 𝑁(𝑦|0, 𝐾) We know 𝑝(𝑡|𝑦) and 𝑝(𝑦), we can derive 𝑝(𝑡) by Then, we have to choose the kernel K. Kernel should satisfy * if input x are similar, then outcome K(x) should also be similar! One famous kernel is, and we have to estimate 𝜃0 … 𝜃3.
  • 13. Chapter 6.4. Gaussian process 13 Making prediction with gaussian process To generate prediction, we are estimating 𝑝(𝑡𝑁+1|t𝑁) , note that there are also 𝑋 in the conditional term! But I am going to omit the term here. Above term can be computed as 𝑝 𝑡𝑁+1 t𝑁 = 𝑝 t𝑁+1 ∶ 𝐽𝑜𝑖𝑛𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠 𝑎𝑛𝑑 𝑛𝑒𝑤𝑙𝑦 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑋 𝑝 t𝑁 ∶ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑝𝑎𝑖𝑟𝑠 Here, 𝐶𝑁 : covariance matrix of 𝑝(t𝑁) & k : 𝐾(𝑋𝑛, 𝑋𝑁+1) / Note that k is a (n x 1) column vector! Final predictive distribution is a gaussian distribution of 𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻 𝑪𝑵 −𝟏 𝒕, 𝑐 − 𝒌𝑻 𝑪𝑵 −𝟏 𝒌) 𝑝(𝒕) = 𝑁(𝒕|0, 𝐶) Note that covariance matrix is positive semi-definite, thus C matrix’s eigenvalue should be 𝝀𝒊 ≥ 𝟎 (𝜷 > 𝟎 means equal to zero is fine!)
  • 14. Chapter 6.4. Gaussian process 14 Making prediction with gaussian process This is very interesting that the predictive distribution is affected by the distribution of X! a. If x is densely distributed, the confidence interval is relatively narrow! b. If x is sparsely distributed, the confidence interval is relative wide! This phenomenon reflects our intuition amazingly!!!
  • 15. Chapter 6.4. Gaussian process 15 Interpretation of ARD Large kernel value indicates it gives significant influence on output of value. That means, the difference in that feature value is important! Thus, it can be incorporated in feature importance view! Left plot indicates 𝑥1 : Most significant feature! 𝑥2 : Relatively less significant and there exists noise. 𝑥3 : Almost no effects! Above equation is a generalized version.
  • 16. Chapter 6.4. Gaussian process 16 Estimation of 𝜽 in a kernel. Naïve method : Using MLE for 𝑝(𝒕|𝜃) Note that, this equation cannot be expressed in closed form. Furthermore, we cannot guarantee whether it is convex form. Automatic relevance determination By using additional parameter, we can estimate which feature is significant in model prediction. Let’s see how the distribution of kernel value is changing by perturbating value of eta. 𝜂1 = 𝜂2 = 1 𝜂1 = 1 & 𝜂2 = 0.01
  • 17. Chapter 6.4. Gaussian process 17 Classification For classification, we should generate a value between (0, 1). Things go in similar manner! First, we are not directly using parameter vector W. (it was 𝑝 𝑋 = 1 1+𝑒−𝑊𝑇𝑋 , but we are not estimating this anymore!) Instead, we directly generate a distribution of 𝑎𝑁+1 = (𝑎1, 𝑎2, … , 𝑎𝑁+1), * N+1 is a vector to be used as an input! Now, we are not having noise term anymore. (For regression, equation was 𝑡𝑛 = 𝑦𝑛 + 𝜖𝑛, this epsilon does not exist anymore!!) Regression case Classification case Here, 𝑝 𝑡𝑁+1 𝑎𝑁+1 = 𝜎(𝑎𝑁+1). Here, integral is relatively hard. Thus, we are using approximation. But first, we have to estimate 𝒑(𝒂𝑵+𝟏| 𝒕𝑵).
  • 18. Chapter 6.4. Gaussian process 18 Laplace approximation with classification For classification, similar method is being applied, For regression, probability was given as 𝑝 𝑡𝑁+1 𝒕 = 𝑁(𝒌𝑻 𝑪𝑵 −𝟏 𝒕, 𝑐 − 𝒌𝑻 𝑪𝑵 −𝟏 𝒌) Now, we should compute 𝑝(𝑎𝑁|𝑡𝑁). This is our interest! It is hard to derive the exact formation of this probability, so we are using Laplace approximation! We are not interested in 𝒕𝑵, thus we are throwing it away! Remaining term is… (log 𝑝(𝑎𝑁|𝑡𝑁) = 𝚿(𝐚𝑵))
  • 19. Chapter 6.4. Gaussian process 19 Laplace approximation with classification Entire process is really complicated! Thus, let’s take a deep breath and check what we are doing! We are trying to compute 𝒑 𝒂𝑵+𝟏 𝒕𝒏 = 𝒑 𝒂𝑵+𝟏 𝒂𝑵 𝒑(𝒂𝑵|𝒕𝑵) 𝒅𝒂𝑵 But this integral is very hard, thus we are trying to approximate the inner pdf to Gaussian by using laplace approximation! Here, we know p(aN+1|aN), but we don’t know 𝑝(𝑎𝑁|𝑡𝑁). So, we are approximating this to gaussian function! To use laplace approximation, we need mode value. We gotta find maximum part by using 1st order derivative! Note that 𝜎𝑁 includes 𝑎𝑁 value, thus this equation cannot be expressed in closed form… Again we use iterative manner (Newton - Raphson) Here, 𝑊𝑁 denotes the diagonal matric whose values are 𝜎(𝑎𝑁)(1 − 𝜎(𝑎𝑁)). Note that 0 ≤ 𝑊𝑁 ≤ 0.25, 𝐶𝑁 −1 is positive semi definite, thus A = −∇∇Ψ(𝑎𝑁) is positive definite! Thus, 𝒑(𝒂𝑵|𝒕𝑵) is log convex → We can reach minimum zone.
  • 20. Chapter 6.4. Gaussian process 20 Laplace approximation with classification So, we can reach minimum 𝑎𝑁 by Newton-Raphson method! Approximation! 𝑝 𝑎𝑁 𝑡𝑁 ≈ Our goal : 𝒑(𝒂𝑵+𝟏|𝒕𝑵) From this, we can derive that 𝐴 = 𝑘𝑇 𝐶𝑁 −1 𝑏 = 0 𝜇 = 𝐶𝑁(𝑡𝑁 − 𝜎𝑁) Λ−1 = 𝐻−1 𝐿−1 = 𝑐 − 𝑘𝑇 𝐶𝑁 −1 𝑘 Putting all these values, we get Finally, we can compute gaussian distribution with mean and variance of
  • 21. Chapter 6.4. Gaussian process 21 Finding 𝜽 values in kernel function Approximation! Density function is given as Note that we are still having parameter in kernel function. In order to estimate 𝜃, we use MLE method. They calculate the gradient sequentially. (Is it possible…?) 𝜳(𝒂𝑵 ★) can be expressed by Ψ 𝑎𝑁 ★ = − 1 2 𝑎★𝑇 𝐶𝑁 −1 𝑎★ − 1 2 ln 𝐶𝑁 + 𝑡𝑁 𝑇 𝑎𝑁 ★ By using the following terms… In fact, I cannot understand how they are being combined!! Did anyone understand??