1. Chapter 3
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
1
2. Chapter 3.1. Basic Linear Regression
2
1. Common linear regression case : 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝐷𝑥𝐷
2. Extending to basis function : 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1𝜙1 𝑋 + 𝑤2𝜙2 𝑋 + ⋯ + 𝑤𝑀−1𝜙𝑀−1(𝑋)
3. Notable fact : There exists a relationship that…
3. Chapter 3.1. Basic Linear Regression
3
We may consider such normal distribution that passes through linear line.
Thus, we can consider the optimization issue as MLE task.
Derivation was covered in
Undergraduate regression analysis
4. Chapter 3.1. Basic Linear Regression
4
Understanding under geometrical perspective
By definition…
Projection matrix : H = 𝐴 𝐴𝑇
𝐴 −1
𝐴𝑇
; HB = Projecting B on the column space of A
Our estimated value : Φ Φ𝑇
Φ −1
Φ𝑇
; HT = Projecting T on the column space of X
Green vector t : Target value
Blue vector y : Estimated value (in our course 𝑦)
Sequential update of linear regression
Familiar form!
Just like gradient descent
5. Chapter 3.1. Basic Linear Regression
5
Regularization
Preventing the overfitting, also called weight decay.
Most common l2 regularization : L =
Min loss without penalty
Min loss with penalty
Min loss without penalty
Min loss with penalty
1
2
𝑡𝑛 − 𝑊𝑇
𝜙 𝑥𝑛
2
+
𝜆
2
𝑊𝑇
𝑊
1
2
𝑡𝑛 − 𝑊𝑇
𝜙 𝑥𝑛
2
+
𝜆
2
|𝑤𝑗|
Theoretically, l1 regularization(lasso) tends to shrink more, which means the sparse solution.
But it’s hard to get first, second order value. Thus, we use numerical optimization for lasso
6. Chapter 3.1. Basic Linear Regression
6
Multiple outputs
This is very interesting part.
If we compute multiple output linear regression, how can we estimate values??
e.g. with 𝑋1, 𝑋2, … , 𝑋𝑝, we are predicting house price and house year at the same time!
This result indicates even if we predict multiple outputs,
We are using the same design matrix, and only changing the target value t.
Geometrically, this indicates we are projecting column vectors of t to the Φ’s column space.
We get the same result if we calculate two outputs separately, since we assume t’s column
vectors are independent.
7. Chapter 3.3. Bayesian linear regression
7
Prior & Posterior of regression
Now we are assuming the probability distribution of the weights (parameters).
Let’s consider simple conjugate prior of normal pdf.
We assume parameter w follows normal distribution!
To make entire process
As simple as we can…
We assume simpler prior
Univariate conjugate prior of
normal dist.
(Normal / Normal / Normal)
Weighted
average
Note that 𝑉𝑎𝑟 𝑊 = 𝛽−1
Φ𝑇
Φ −1
𝑊𝑚𝑙 = Φ𝑇
Φ −1
Φ𝑇
𝒕
This part is the
weighted prior mean
This part is the
weighted MLE mean
𝛽 ∗ (Φ𝑇
Φ) Φ𝑇
Φ −1
Φ𝑇
𝒕
8. Chapter 3.3. Bayesian linear regression
8
Intrinsic regularization of bayes regression
We know that likelihood x prior is proportional to the posterior.
Then let’s re-consider posterior at this point of view.
ln 𝑝 𝑤 𝑡) = ln exp −
𝛽
2
𝑡𝑛 − 𝑊𝑇
𝜙 𝑥𝑛
2
+ ln exp −
𝛼
2
𝑊𝑇
𝑊 + 𝐶 , where 𝐶 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 (nuisance parameters)
∴ ln 𝑝 𝑤 𝑡) ≈ −
𝛽
2
𝑡𝑛 − 𝑊𝑇
𝜙 𝑥𝑛
2
−
𝜶
𝟐
𝑾𝑻
𝑾
Even we did not intend to include the regularization, prior itself acts as a regularization!
Figure shows the sequential updating process of the posterior/prior.
We can see the variance of distribution reduces slowly.
9. Chapter 3.3. Bayesian linear regression
9
Predictive distribution of Bayesian linear regression
To get the predicted value, we don’t need the parameter distribution itself.
We only need some specific estimated values, like bayes estimator.
Derivation of following equation will be covered in chapter 8.
Important thing is, as 𝑁 → ∞, variance of posterior converges to zero,
and only left noise variance term
1
𝛽
.
Fitted line Generated samples
10. Chapter 3.3. Bayesian linear regression
10
Predictive distribution of Bayesian linear regression
Since we have studied entire linear regression on the perspective of frequentist, this process is really tricky.
Thus, let’s implement the entire process in a python.
11. Chapter 3.3. Bayesian linear regression
11
Equivalent kernel and its insight
Let’s talk about the kernel. First, we can get the predicted value of Bayesian regression by the following equations.
This k function is called smoother matrix or the equivalent kernel.
What the heck does this kernel indicates??
This gives an important intuition about the linear regression on the perspective of “weighted average of neighbors”.
You can see a kernel acts as an “similarity measure”. And it is being multiplied with the observed target values 𝒕𝒏.
What does it mean? It shows the estimating process is the weighted mean of the observed target values.
So called kernel, the similarity measure, gives more weights to the true value.
So, if input values have high similarity, it gets higher weights.
Following equations yield similar intuitions.
12. Chapter 3.4. Bayesian model comparison
12
Equivalent kernel and its insight
Let’s talk about the kernel. First, we can get the predicted value of Bayesian regression by the following equations.
This k function is called smoother matrix or the equivalent kernel.
What the heck does this kernel indicates??
This gives an important intuition about the linear regression on the perspective of “weighted average of neighbors”.
You can see a kernel acts as an “similarity measure”. And it is being multiplied with the observed target values 𝒕𝒏.
What does it mean? It shows the estimating process is the weighted mean of the observed target values.
So called kernel, the similarity measure, gives more weights to the true value.
So, if input values have high similarity, it gets higher weights.
Following equations yield similar intuitions.
13. Chapter 3.5. The evidence approximation
13
Fully Bayesian treatment
The real predictive distribution is equal to the following equation.
This integral is analytically intractable! Thus, we use other approach.
If distribution 𝑝(𝛼, 𝛽|𝑡) is sharply peaked around (𝛼, 𝛽), we can replace integral process by putting estimated values. That is,
14. Chapter 3.5. The evidence approximation
14
Evaluation of the evidence function
What we are trying to do is to estimate the nuisance parameter 𝛼 & 𝛽
Which can be known as the likelihood x prior. Overall equation can be rewritten as the following equations.
Which was covered in previous sections.
Now, let’s re-write the 𝐸(𝑊) by the followings.
Then why are we re-writing the equation?
1. We can perform the integral much easier.
2. We can get model comparison.
3. We can get nuisance parameter estimation.
15. Chapter 3.5. The evidence approximation
15
Re-writing evidence function
16. Chapter 3.5. The evidence approximation
16
Evidence function for the model comparison
Which model is best for the data??
= Model that yields the best evidence value! max
This difficult integration was
computed easily by re-written
equation!!
17. Chapter 3.5. The evidence approximation
17
Nuisance parameter estimation of 𝜶 & 𝜷
Why? Determinant is equal to the product of it’s eigen values!!
(We covered this in multivariate analysis!)
Prior variance 𝜶 Likelihood variance 𝜷 (Similar to 𝜎2
)