PRML Chapter 2

Chapter 2
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics

Chapter 2. Probability Distribution
2
“Frequentist (or classical)” vs “Bayesian”
Former assumes parameter is an unknown fixed constant, and we are trying to estimate it.
Latter assumes parameter is a random variable which has its own distribution, and we are trying to estimate it.
Second issue is parametric & non-parametric.
Some of you might took non-parametric statistics, even if you didn’t it really doesn’t matter.
For parametric, we assume a specific form of a distribution. (𝑋 ~ 𝑁(𝜇𝑋, Σ𝑥))
On the other hand, for non-parametric, we do not assume specific form!
That does not mean we do not use distribution! We still use distribution, but we try to approximately find distribution!
In this chapter, we are going to cover various distributions.
Furthermore, we are going to take a look at prior & posterior distribution of Bayesian statistics!
Okay now let’s get it!
Introduction of chp2

Chapter 2.1. Binary variables
3
Consider coin toss! Response variable takes a form of binary variable!
That is, 𝑥 ∈ {0, 1}
Probability can be approximated by 𝑝 𝑥 = 1 𝜇 = 𝜇
In mathematical statistics II, this was 𝑝 𝑋 = 1 𝜃 = 𝜃. (They are the same!)
This is a well-known distribution, a Bernoulli distribution!
Bernoulli distribution
When we observe multiple trials, likelihood function is becoming
Computational convenience
𝒍𝒐𝒈
We can compute MLE from this likelihood function! (It’s not very hard!
Note that this MLE is a summation of observed value, and it is “sufficient statistics!” (This also was covered in mathematical statistics II!)

4
Definition is… (From Prof. Kang’s Mathematical Statistics II lecture node!)
Review. Sufficient statistics
For short, it is a statistics which can make conditional distribution does not depend on unknown parameter.
Intuitively, it absorbs the parameter’s information, and makes certain distribution easy to process!
This plays an important role in finding MVUE(min. variance unbiased estimator)
It’s importance in this book will be introduced soon.

5
Binomial distribution
It is an expansion of Bernoulli distribution to data size N.
That is “How many 𝑋 = 1 observed from N trials?”
Note that each trials are “independent”, thus mean and variance can be simply extend from basic Bernoulli.
Beta distribution
Note that above treatment was a “frequentist approach”.
Let’s consider Bayes approach.
Here, let’s condiser 𝜇 is a random variable and has its own distribution!
Then, which form should this 𝜇 be?
Probably, beta distribution! (Mathematical statistics I)
Beta distribution’s domain is [0, 1]. Thus, it is great to approximate probability distribution!

6
Beta distribution
Famous for prior distribution of binomial family!
Note that 0 < 𝜇 < 1 satisfies!
Conjugate prior
If posterior distribution of a certain model is same as it’s prior, then
that prior is called “Conjugate prior!”
This is important since we can get posterior in a much simple way!

7
Bayesian posterior & Predictive distribution
Likelihood
function
Prior
Here, denominator does not depend on parameter, thus we can ignore now (This is actually probability normalizer!)
Here, likelihood is product of Bernoulli, and prior is beta.
Thus, it becomes
Prior distribution can be the information from related data or our prior belief!
Posterior distribution changes as it merges information achieved via observed samples!

8
Predictive distribution
Note that 𝜇 is a random variable under bayes treatment.
Thus, in order to generate prediction, we have to marginalize out 𝜇.
Here, we can get some important intuition.
Likelihood : 𝑚 ∶ 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 & 𝑙 ∶ 𝑁 − 𝑚
Prior : 𝑎 & 𝑏 𝑝𝑎𝑟𝑎𝑚
𝑚 + 𝑙
𝑚 + 𝑎 + 𝑙 + 𝑏
∗
𝑚
𝑚 + 𝑙
+
𝑎 + 𝑏
𝑚 + 𝑎 + 𝑙 + 𝑏
∗
𝑎
𝑎 + 𝑏
=
𝑚 + 𝑎
𝑚 + 𝑎 + 𝑙 + 𝑏
MLE Prior
This means posterior mean is a weighted average of MLE and prior mean!
As the number of data increases, prior’s weight decrease, and MLE’s weight is increasing!!
Furthermore, posterior variance is becoming less as data is keep observed, and ≤ than prior! This can be generalized by law of total expectation and variance
Prior
Posterior

Chapter 2.2. Multinomial Variables
9
Multiclass classification
Consider MNIST which has various output value!
In previous chapter, we only thought about two classes, 𝜇 & 1 − 𝜇.
Now, there are various classes, so,
Likelihood is defined as
For the multiple trial, likelihood becomes,
To estimate parameters, we again use MLE, but we have constraint of 𝜇𝑘 = 1
Here, 𝝀 = −𝑵 is obtained by using
lagrangian optimization!
Just like binomial,
This can be extended!!

Chapter 2.2. Multinomial Variables
10
Dirichlet distribution
Now, again we have to think of prior distribution for multinomial distribution!
Consider beta distribution was
So, we can say Dirichlet distribution is an extension of beta to the multi-class version.
0 ≤ 𝜇𝑘 ≤ 1 and 𝜇𝑘 = 1. Where 𝜶 denotes 𝛼1, 𝛼2, … , 𝛼𝐾
𝑇
Entire process is just an extension of
previous binomial-beta pair!

Chapter 2.3. Gaussian Distribution
11
Introduction to Gaussian distribution
For statistician, Gaussian(Normal) distribution is one of the most important distribution!
It has some nice properties, especially central limit theorem!
Review. C.L.T.
Let 𝑋1, 𝑋2, … , 𝑋𝑛 be a random sample from a distribution which has mean 𝜇 and variance 𝜎2
< ∞.
Then, 𝑌𝑛 =
𝑛 𝑋𝑛 − 𝜇
𝜎
has limiting distribution of 𝑁(0,1)
Review Delta Method.
Consider there exist a continuous function 𝑢(𝑋𝑛), then 𝑢(𝑋𝑛) follows
𝑢 𝑋𝑛 ~ 𝑁( 𝑢 𝜇 , 𝑢′
𝜇 2
∗
𝜎2
𝑛
) / This can be easily proven by taylor series expansion!
Univariate Multivariate

12
Analytical properties of gaussian distribution
Look at this quadratic form part!
∆𝟐
= 𝑿 − 𝝁 𝑻
𝚺−𝟏
(𝑿 − 𝝁)
This equation is equivalent to the ‘Mahalanobis distance!’
Since Σ is a symmetric real value matrix, we can perform spectral(eigen value) decomposition on it!
Here, we can set 𝑢𝑖 to be the orthonormal set! Then, they satisfy
Note that inverse matrix can be expressed by
the inverse of eigen values
By using this, we can re-write ∆ value by below method!
This result gives following geometrical intuition!

13
Standardization of normal distribution
We have done Z =
𝑋−𝜇
𝜎
for numerous time!
This holds same in multivariate version!
Rewind this transformation!
This means that Jacobian, |𝑱| is equal to 1!
This transformation implies some important idea!
I. Transformation 𝑋 → 𝑌 makes each variable independent!
II. Overall probability is expressed as a product of each independent normal distribution!
III. This is still probability density function, since is integrated to the 1.
IV. Geometrically, this means that distribution is shifted and rotated!
Below shows the moment of normal distribution, but let’s skip since we all know.

14
Conditional normal distribution
Let’s think of gaussian distribution of joint subset!
Overall distribution is 𝑋~𝑁(𝒙, 𝚺). Let’s partition this to 𝑋𝑎, 𝑋𝑏. Then…
First, what we want to achieve is 𝑝(𝑋𝑎|𝑋𝑏). This is a
𝑝 𝑋𝑎,𝑋𝑏
𝑝(𝑋𝑏)
. Since 𝑝(𝑋𝑎|X𝑏) depends on 𝑋𝑎, we have to find the form of 𝑝(𝑋𝑎, 𝑋𝑏)!
Let’s re-write the exponent!
Still function is an exponent form of 𝑋𝑎.
Thus, we can infer the conditional distribution also becomes Gaussian!
Overall calculation is not necessary, and we covered it in multivariate analysis!
Let’s remember this, since it will be continouesly used !

15
Marginal normal distribution
Again, let’s assume 𝑝 𝑋 = 𝑝 𝑋𝑎, 𝑋𝑏 → 𝑁(𝜇, Σ).
Then, what is a distribution of 𝑝(𝑋𝑎)??
Here, calculation is a bit tricky. Just keep in mind, we can find it again has normal distribution!!
Summary

16
Bayes theorem for Gaussian distribution
These equations are very useful, and we are using this to easily compute posterior and predictive distribution!!
It was derived by (Result is much important!)

17
Parameter estimation
Note that μ & Σ are the unknown parameters! Thus, we have to estimate them by MLE!!
By derivative… we’ve done it so~ many~ times~ in mathematical statistics II, let’s skip the procedure!
Note that estimator of covariance
is a bias estimator!
Sequential estimation
This gives pretty interesting intuition! Take a look at following functional value!
Consider data 𝑋1, 𝑋2, … , 𝑋𝑁−1 were observed, and we just observed 𝑋𝑁.
Then, mean is moving towards 𝑋𝑁 − 𝜇(𝑁−1)
a bit, as much as
1
𝑁
!
This gives clear intuition of sequential numerical approach.
However, MLE and sequential result are not always same.

18
Robbins and Monro algorithm
There are various methods in finding root point! Let’s see Robbins & Monro method!
There are some assumptions,
a. Conditional variance is finite : 𝐸 𝑧 − 𝑓 2
𝜃 < ∞
b. For 𝑓 𝜃 > 0 , 𝜃 > 𝜃★
c. For 𝑓 𝜃 < 0 , 𝜃 < 𝜃★
Equation is… (Here 𝑧(𝜃) is an observed value with estimated theta!)
Where 𝑎𝑁 is a sequence of positive numbers that satisfy
This can be applied to MLE! (Since sometimes derivative = 0 is hard to find)
For example, for mean, equation can be..

19
Bayesian inference for the Gaussian mean (Normal - Normal)
Here, we are trying to find the posterior and prior of a gaussian distribution!
For parameters, there are 𝜇 & Σ, and we are trying to generate distribution for them.
First, let’s think of mean parameter 𝜇
We can think of likelihood function as an exponent form of 𝜇, and this implies prior for 𝝁 can also be Gaussian!
We have set
Prior : Gaussian
Likelihood : Gaussian
And outcome posterior also
becomes Gaussian!
* I skipped detail calculation
Note that mean of posterior is a
weighted average of prior mean
and likelihood mean!
Here, as 𝑁 → ∞, precision goes
infinity!
Changes of posterior of 𝝁

20
Sequential approach
Consider we have observed 𝑥1, … , 𝑥𝑁−1, and we just observed 𝑥𝑁.
Previous posterior can be re-expressed as…
Prior New likelihood
Inference of variance (Gamma – Gamma pair)
We assumed we know variance, but now we move on to the unknown variance!
Here, let 𝜆 =
1
𝜎2, the precision! Then likelihood function becomes…
We are trying to set “Conjugate prior”. We use gamma distribution as prior!
Final posterior becomes 𝐺𝑎𝑚𝑚𝑎(𝑎𝑁, 𝑏𝑁)

21
Multi parameters
Now, let’s think of case where we don’t know 𝜇 & Σ together!
Normally, it is reasonable to assume that p 𝜇 𝜆 follows gaussian distribution, and its precision is a linear function 𝜆.
Thus, we can rewrite joint probability as 𝑝 𝜇, 𝜆 = 𝑝 𝜇 𝜆 𝑝 𝜆 = 𝑁 𝜇 𝜇0, 𝛽𝜆 −1
𝐺𝑎𝑚𝑚𝑎(𝜆|𝑎, 𝑏).
Here, 𝜇0 =
𝑐
𝛽
, 𝑎 = 1 +
𝛽
2
, 𝑏 = 𝑑 −
𝑐2
2𝛽
This distribution is known as normal-gamma distribution!
Remember we are managing conjugate prior!
Normal-gamma
Wishart distribution
For the multivariate gaussian,
We are trying to form a distribution of covariance, for known mean!
Here, conjugate prior becomes
Here, 𝑊 is a scale matrix of size 𝐷 𝑋 𝐷
If both mean and covariance is unknown,
then the prior can be given as
Normal Wishart!

22
Student’s t-distribution
We all know t-distribution so well that it follows,
𝑡 ~
𝑍
𝑉
𝑘
, Where Z : Standard normal distribution, V : Chi-square distribution, k : d.f. of chisquare / and Z – V are independent!
Here, there are 𝑁(𝑥|𝜇, 𝜏−1
) together with gamma prior 𝐺𝑎𝑚(𝜏|𝑎, 𝑏). Can we make t-distribution with these values?
We are trying to find the distribution of x under various variance values. Then, let’s marginalize out covariance matrix
Where 𝝂 means the degree of freedom.
𝝂 = 𝟏, 𝑪𝒂𝒖𝒄𝒉𝒚 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝝂 → ∞, 𝑵(𝒙|𝝁)
This becomes student’s t-distribution.
This means we are getting the distribution of normal under infinite
number of covariance values!
Note that t-distribution has thicker tail, which is connected to a “robustness”
That is, it is relatively less sensitive to the outliers.
Uni:
Multi:

23
Mixtures of Gaussians
Till now, we considered models of uni-mode.
What will be the shape of a distribution with various modes??
That is, a linear combination of gaussian distributions!
Here, mixing coefficient 𝜋 should satisfy following conditions.
Here, each sub-distribution is called ‘component’.
Then, how can we estimate “Which component should we assign 𝑋𝑛 to?? ” → Let’s use bayes theorem!
To which component should we assign specific data?
Estimating of 𝝅, 𝝁, 𝚺 will be covered in chapter 9!
Via EM algorithm!

Chapter 2.4. Exponential Family
24
In mathematical statistics..
We have learned that distribution of exponential family is a complete distribution!
(Which satisfies 𝐸 𝑢 𝑋 = 0 𝑖𝑓 𝑎𝑛𝑑 𝑜𝑛𝑙𝑦 𝑖𝑓 𝑢 𝑋 = 0)
Definition
Exponential family of distributions over 𝑋, given parameter 𝜂, is defined to be the set of distributions of the form
We can consider 𝑔(𝜂) to be a normalization constant that makes integral to be 1.
Ex 1. Bernoulli To the shape of this equation!
𝝁 = 𝝈 𝜼 =
𝟏
𝟏 + 𝐞𝐱𝐩(−𝜼)

25
Multinomial distribution
Let’s extent previous example to the multinomial distribution!
Where 𝜂𝑘 = ln 𝜇𝑘
In fact, we didn’t consider the constraint
Let’s re-write the equation by…
𝟏 −
𝒌=𝟏
𝑴−𝟏
𝝁𝒌 = 𝝁𝑴

26
Gaussian distribution
Everything is same for gaussian. Gaussian is also an exponential family!
Maximum likelihood and sufficient statistics
For exponential family, it is easy to achieve the moments!
𝜕
𝜕𝜂
𝜕
𝜕𝜂
Similarly, we can get higher order moments
by simple derivative of this term!

27
Usage of sufficient statistics
Let’s calculate the likelihood function of exponential family!
By gradient, we get…
By using exponential and inverse function 𝑔−1
(. ), we can achieve MLE!
This result indicates that solution for 𝜼𝑴𝑳 only depends on sample mean of 𝒖(𝒙𝒏)!
This is a sufficient statistics!!
Conjugate prior
Rewind conjugate prior : Posterior has the same distribution to the prior distribution.
Note that there always exist conjugate prior for exponential family!
Prior
Likelihood
Posterior
Note that prior & posterior has
the same functional form!!

28
Noninformative priors
In Bayesian, we can set reasonable prior by using other information(knowledge)!
However, in some cases, we might not have enough (or accurate) information.
Here, we want to make the as small influence of prior as we can! → Non-informative prior
(This example is from prof. Kang’s lecture note.)
Consider binomial likelihood and beta prior & posterior.
If we do not have any prior information, it is reasonable to set 𝛼 = 𝛽 = 1.
Then, it becomes uniform distribution!
But there still exist an influence of prior by
𝟏
𝟐
!
Here, we can set 𝛼 = 𝛽 = 0, but then prior is not a pdf anymore.
Like this, if a prior itself is not a pdf, but its posterior is pdf(proper), such priors are called improper prior.
Note that,
I. Noninformative prior is not always improper.
II. Likewise, improper prior is not always noninformative!
They are not if and only if!

29
Noninformative priors
Translation invariance
With noninformative prior, there exist some difficulties regarding variable transformation.
Simply, think of this.
Function ℎ 𝜆 = 𝑐 where c is a constant. Set 𝜆 = 𝜂2
Then, it is obvious that ℎ 𝜆 = ℎ(𝜆2
).
However, pdf is different. We calculate Jacobian, thus…
Transformed pdf now depends on 𝜂, not constant anymore…
Note that this issue does not arise with likelihood function, since likelihood function takes the parameter as a given term!
Examples of noninformative prior
𝒙 = 𝒙 + 𝒄
𝝁 = 𝝁 + 𝒄
Thus, we need prior that gives equal probability to
𝐴 ≤ 𝜇 ≤ 𝐵 and 𝐴 − 𝑐 ≤ 𝜇 ≤ 𝐵 − 𝑐
Example can be gaussian
With 𝝈𝟎
𝟐
→ ∞
Scale invariance
𝒙 = 𝒄𝒙
𝝈 = 𝒄𝝈
Thus, we need prior that gives equal probability to
𝐴 ≤ 𝜇 ≤ 𝐵 and
𝐴
𝑐
≤ 𝜇 ≤
𝐵
𝑐
Example can be gaussian
With 𝒈𝒂𝒎(𝝀|𝟎, 𝟎)

Chapter 2.5. Nonparametric Methods
30
Nonparametric approach
Till now, we assumed specific type of a probability distribution.
Now, we are learning nonparametric density approach. We are trying to generate a density of data.
Histogram
We have learned histogram from ‘intro to stat’ to ‘nonparametric statistics’!
We are counting the number of data that falls into a 𝑖𝑡ℎ
bin!
Commonly, we set all Δ𝑖 to be same! (Constant)
- We do not need data anymore once distribution is computed.
- Use for quick visualization
- For multi-dim, M bins with D dimension data gets 𝑀𝐷
complexity!
- In histogram, to find the probability of a specific position of data, we need to
search the closest data!
- Furthermore, we need adequate value of bin, the delta.

31
Kernel density method
Let’s generalize this idea. There is an unknown probability density 𝑝(𝑥).
Consider a specific region in data space, which is ℛ. Probability of data to be in ℛ is
Then, under i.i.d. assumption, each data has probability of P to fall into region ℛ! For N data, this can be expressed as binomial distribution,
I. By this, we can infer that 𝐸
𝐾
𝑁
= 𝑃 , 𝑉
𝐾
𝑁
=
𝑃 1−𝑃
𝑁
.
II. As 𝑁 → ∞, this distribution will be very sharp, (variance approximately zero), thus 𝐾 ≅ 𝑁𝑃
III. If ℛ is sufficiently small, integral can be considered as a square(?), thus 𝑃 ≅ 𝑝 𝑋 𝑉 , where V is a volume of ℛ.
IV. By joining the result of 𝐼𝐼 & 𝐼𝐼𝐼, we get
Kernel estimation assumes,
1. Region is very small!
2. Large number of data
Here, we can either fix K or V (Since N is a number of data, given!)
In kernel method, it is reasonable to think that we are fixing the data!

32
Kernel density method
Before getting into detail kernel method, think of kernel function
That means, for a specific region 𝑋, we are counting the number of data points that fall into the region of 𝑿.
ℎ𝐷
: 𝑉𝑜𝑙𝑢𝑚𝑒 𝑜𝑓 ℎ𝑦𝑝𝑒𝑟 𝑐𝑢𝑏𝑒!
However, basic kernel method gives stair shape density,
Thus, we can use gaussian kernel to get some smooth shape!
Using Gaussian kernel…
(See how density changes as h changes…)
We can choose any kernel
That satisfies right conditions!
Do you have any idea??

33
Nearest-neighbor methods
Hyperparameter ℎ depends on data and it is pretty hard to find adequate value of ℎ.
We can overcome this issue by using nearest-neighbor method!

PRML Chapter 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PRML Chapter 2

Similar to PRML Chapter 2 (20)

Recently uploaded

Recently uploaded (20)

PRML Chapter 2