SlideShare a Scribd company logo
1 of 33
Chapter 2
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 2. Probability Distribution
2
“Frequentist (or classical)” vs “Bayesian”
Former assumes parameter is an unknown fixed constant, and we are trying to estimate it.
Latter assumes parameter is a random variable which has its own distribution, and we are trying to estimate it.
Second issue is parametric & non-parametric.
Some of you might took non-parametric statistics, even if you didn’t it really doesn’t matter.
For parametric, we assume a specific form of a distribution. (𝑋 ~ 𝑁(𝜇𝑋, Σ𝑥))
On the other hand, for non-parametric, we do not assume specific form!
That does not mean we do not use distribution! We still use distribution, but we try to approximately find distribution!
In this chapter, we are going to cover various distributions.
Furthermore, we are going to take a look at prior & posterior distribution of Bayesian statistics!
Okay now let’s get it!
Introduction of chp2
Chapter 2.1. Binary variables
3
Consider coin toss! Response variable takes a form of binary variable!
That is, 𝑥 ∈ {0, 1}
Probability can be approximated by 𝑝 𝑥 = 1 𝜇 = 𝜇
In mathematical statistics II, this was 𝑝 𝑋 = 1 𝜃 = 𝜃. (They are the same!)
This is a well-known distribution, a Bernoulli distribution!
Bernoulli distribution
When we observe multiple trials, likelihood function is becoming
Computational convenience
𝒍𝒐𝒈
We can compute MLE from this likelihood function! (It’s not very hard!
Note that this MLE is a summation of observed value, and it is “sufficient statistics!” (This also was covered in mathematical statistics II!)
Chapter 2.1. Binary variables
4
Definition is… (From Prof. Kang’s Mathematical Statistics II lecture node!)
Review. Sufficient statistics
For short, it is a statistics which can make conditional distribution does not depend on unknown parameter.
Intuitively, it absorbs the parameter’s information, and makes certain distribution easy to process!
This plays an important role in finding MVUE(min. variance unbiased estimator)
It’s importance in this book will be introduced soon.
Chapter 2.1. Binary variables
5
Binomial distribution
It is an expansion of Bernoulli distribution to data size N.
That is “How many 𝑋 = 1 observed from N trials?”
Note that each trials are “independent”, thus mean and variance can be simply extend from basic Bernoulli.
Beta distribution
Note that above treatment was a “frequentist approach”.
Let’s consider Bayes approach.
Here, let’s condiser 𝜇 is a random variable and has its own distribution!
Then, which form should this 𝜇 be?
Probably, beta distribution! (Mathematical statistics I)
Beta distribution’s domain is [0, 1]. Thus, it is great to approximate probability distribution!
Chapter 2.1. Binary variables
6
Beta distribution
Famous for prior distribution of binomial family!
Note that 0 < 𝜇 < 1 satisfies!
Conjugate prior
If posterior distribution of a certain model is same as it’s prior, then
that prior is called “Conjugate prior!”
This is important since we can get posterior in a much simple way!
Chapter 2.1. Binary variables
7
Bayesian posterior & Predictive distribution
Likelihood
function
Prior
Here, denominator does not depend on parameter, thus we can ignore now (This is actually probability normalizer!)
Here, likelihood is product of Bernoulli, and prior is beta.
Thus, it becomes
Prior distribution can be the information from related data or our prior belief!
Posterior distribution changes as it merges information achieved via observed samples!
Chapter 2.1. Binary variables
8
Predictive distribution
Note that 𝜇 is a random variable under bayes treatment.
Thus, in order to generate prediction, we have to marginalize out 𝜇.
Here, we can get some important intuition.
Likelihood : 𝑚 ∶ 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 & 𝑙 ∶ 𝑁 − 𝑚
Prior : 𝑎 & 𝑏 𝑝𝑎𝑟𝑎𝑚
𝑚 + 𝑙
𝑚 + 𝑎 + 𝑙 + 𝑏
∗
𝑚
𝑚 + 𝑙
+
𝑎 + 𝑏
𝑚 + 𝑎 + 𝑙 + 𝑏
∗
𝑎
𝑎 + 𝑏
=
𝑚 + 𝑎
𝑚 + 𝑎 + 𝑙 + 𝑏
MLE Prior
This means posterior mean is a weighted average of MLE and prior mean!
As the number of data increases, prior’s weight decrease, and MLE’s weight is increasing!!
Furthermore, posterior variance is becoming less as data is keep observed, and ≤ than prior! This can be generalized by law of total expectation and variance
Prior
Posterior
Chapter 2.2. Multinomial Variables
9
Multiclass classification
Consider MNIST which has various output value!
In previous chapter, we only thought about two classes, 𝜇 & 1 − 𝜇.
Now, there are various classes, so,
Likelihood is defined as
For the multiple trial, likelihood becomes,
To estimate parameters, we again use MLE, but we have constraint of 𝜇𝑘 = 1
Here, 𝝀 = −𝑵 is obtained by using
lagrangian optimization!
Just like binomial,
This can be extended!!
Chapter 2.2. Multinomial Variables
10
Dirichlet distribution
Now, again we have to think of prior distribution for multinomial distribution!
Consider beta distribution was
So, we can say Dirichlet distribution is an extension of beta to the multi-class version.
0 ≤ 𝜇𝑘 ≤ 1 and 𝜇𝑘 = 1. Where 𝜶 denotes 𝛼1, 𝛼2, … , 𝛼𝐾
𝑇
Entire process is just an extension of
previous binomial-beta pair!
Chapter 2.3. Gaussian Distribution
11
Introduction to Gaussian distribution
For statistician, Gaussian(Normal) distribution is one of the most important distribution!
It has some nice properties, especially central limit theorem!
Review. C.L.T.
Let 𝑋1, 𝑋2, … , 𝑋𝑛 be a random sample from a distribution which has mean 𝜇 and variance 𝜎2
< ∞.
Then, 𝑌𝑛 =
𝑛 𝑋𝑛 − 𝜇
𝜎
has limiting distribution of 𝑁(0,1)
Review Delta Method.
Consider there exist a continuous function 𝑢(𝑋𝑛), then 𝑢(𝑋𝑛) follows
𝑢 𝑋𝑛 ~ 𝑁( 𝑢 𝜇 , 𝑢′
𝜇 2
∗
𝜎2
𝑛
) / This can be easily proven by taylor series expansion!
Univariate Multivariate
Chapter 2.3. Gaussian Distribution
12
Analytical properties of gaussian distribution
Look at this quadratic form part!
∆𝟐
= 𝑿 − 𝝁 𝑻
𝚺−𝟏
(𝑿 − 𝝁)
This equation is equivalent to the ‘Mahalanobis distance!’
Since Σ is a symmetric real value matrix, we can perform spectral(eigen value) decomposition on it!
Here, we can set 𝑢𝑖 to be the orthonormal set! Then, they satisfy
Note that inverse matrix can be expressed by
the inverse of eigen values
By using this, we can re-write ∆ value by below method!
This result gives following geometrical intuition!
Chapter 2.3. Gaussian Distribution
13
Standardization of normal distribution
We have done Z =
𝑋−𝜇
𝜎
for numerous time!
This holds same in multivariate version!
Rewind this transformation!
This means that Jacobian, |𝑱| is equal to 1!
This transformation implies some important idea!
I. Transformation 𝑋 → 𝑌 makes each variable independent!
II. Overall probability is expressed as a product of each independent normal distribution!
III. This is still probability density function, since is integrated to the 1.
IV. Geometrically, this means that distribution is shifted and rotated!
Below shows the moment of normal distribution, but let’s skip since we all know.
Chapter 2.3. Gaussian Distribution
14
Conditional normal distribution
Let’s think of gaussian distribution of joint subset!
Overall distribution is 𝑋~𝑁(𝒙, 𝚺). Let’s partition this to 𝑋𝑎, 𝑋𝑏. Then…
First, what we want to achieve is 𝑝(𝑋𝑎|𝑋𝑏). This is a
𝑝 𝑋𝑎,𝑋𝑏
𝑝(𝑋𝑏)
. Since 𝑝(𝑋𝑎|X𝑏) depends on 𝑋𝑎, we have to find the form of 𝑝(𝑋𝑎, 𝑋𝑏)!
Let’s re-write the exponent!
Still function is an exponent form of 𝑋𝑎.
Thus, we can infer the conditional distribution also becomes Gaussian!
Overall calculation is not necessary, and we covered it in multivariate analysis!
Let’s remember this, since it will be continouesly used !
Chapter 2.3. Gaussian Distribution
15
Marginal normal distribution
Again, let’s assume 𝑝 𝑋 = 𝑝 𝑋𝑎, 𝑋𝑏 → 𝑁(𝜇, Σ).
Then, what is a distribution of 𝑝(𝑋𝑎)??
Here, calculation is a bit tricky. Just keep in mind, we can find it again has normal distribution!!
Summary
Chapter 2.3. Gaussian Distribution
16
Bayes theorem for Gaussian distribution
These equations are very useful, and we are using this to easily compute posterior and predictive distribution!!
It was derived by (Result is much important!)
Chapter 2.3. Gaussian Distribution
17
Parameter estimation
Note that μ & Σ are the unknown parameters! Thus, we have to estimate them by MLE!!
By derivative… we’ve done it so~ many~ times~ in mathematical statistics II, let’s skip the procedure!
Note that estimator of covariance
is a bias estimator!
Sequential estimation
This gives pretty interesting intuition! Take a look at following functional value!
Consider data 𝑋1, 𝑋2, … , 𝑋𝑁−1 were observed, and we just observed 𝑋𝑁.
Then, mean is moving towards 𝑋𝑁 − 𝜇(𝑁−1)
a bit, as much as
1
𝑁
!
This gives clear intuition of sequential numerical approach.
However, MLE and sequential result are not always same.
Chapter 2.3. Gaussian Distribution
18
Robbins and Monro algorithm
There are various methods in finding root point! Let’s see Robbins & Monro method!
There are some assumptions,
a. Conditional variance is finite : 𝐸 𝑧 − 𝑓 2
𝜃 < ∞
b. For 𝑓 𝜃 > 0 , 𝜃 > 𝜃★
c. For 𝑓 𝜃 < 0 , 𝜃 < 𝜃★
Equation is… (Here 𝑧(𝜃) is an observed value with estimated theta!)
Where 𝑎𝑁 is a sequence of positive numbers that satisfy
This can be applied to MLE! (Since sometimes derivative = 0 is hard to find)
For example, for mean, equation can be..
Chapter 2.3. Gaussian Distribution
19
Bayesian inference for the Gaussian mean (Normal - Normal)
Here, we are trying to find the posterior and prior of a gaussian distribution!
For parameters, there are 𝜇 & Σ, and we are trying to generate distribution for them.
First, let’s think of mean parameter 𝜇
We can think of likelihood function as an exponent form of 𝜇, and this implies prior for 𝝁 can also be Gaussian!
We have set
Prior : Gaussian
Likelihood : Gaussian
And outcome posterior also
becomes Gaussian!
* I skipped detail calculation
Note that mean of posterior is a
weighted average of prior mean
and likelihood mean!
Here, as 𝑁 → ∞, precision goes
infinity!
Changes of posterior of 𝝁
Chapter 2.3. Gaussian Distribution
20
Sequential approach
Consider we have observed 𝑥1, … , 𝑥𝑁−1, and we just observed 𝑥𝑁.
Previous posterior can be re-expressed as…
Prior New likelihood
Inference of variance (Gamma – Gamma pair)
We assumed we know variance, but now we move on to the unknown variance!
Here, let 𝜆 =
1
𝜎2, the precision! Then likelihood function becomes…
We are trying to set “Conjugate prior”. We use gamma distribution as prior!
Final posterior becomes 𝐺𝑎𝑚𝑚𝑎(𝑎𝑁, 𝑏𝑁)
Chapter 2.3. Gaussian Distribution
21
Multi parameters
Now, let’s think of case where we don’t know 𝜇 & Σ together!
Normally, it is reasonable to assume that p 𝜇 𝜆 follows gaussian distribution, and its precision is a linear function 𝜆.
Thus, we can rewrite joint probability as 𝑝 𝜇, 𝜆 = 𝑝 𝜇 𝜆 𝑝 𝜆 = 𝑁 𝜇 𝜇0, 𝛽𝜆 −1
𝐺𝑎𝑚𝑚𝑎(𝜆|𝑎, 𝑏).
Here, 𝜇0 =
𝑐
𝛽
, 𝑎 = 1 +
𝛽
2
, 𝑏 = 𝑑 −
𝑐2
2𝛽
This distribution is known as normal-gamma distribution!
Remember we are managing conjugate prior!
Normal-gamma
Wishart distribution
For the multivariate gaussian,
We are trying to form a distribution of covariance, for known mean!
Here, conjugate prior becomes
Here, 𝑊 is a scale matrix of size 𝐷 𝑋 𝐷
If both mean and covariance is unknown,
then the prior can be given as
Normal Wishart!
Chapter 2.3. Gaussian Distribution
22
Student’s t-distribution
We all know t-distribution so well that it follows,
𝑡 ~
𝑍
𝑉
𝑘
, Where Z : Standard normal distribution, V : Chi-square distribution, k : d.f. of chisquare / and Z – V are independent!
Here, there are 𝑁(𝑥|𝜇, 𝜏−1
) together with gamma prior 𝐺𝑎𝑚(𝜏|𝑎, 𝑏). Can we make t-distribution with these values?
We are trying to find the distribution of x under various variance values. Then, let’s marginalize out covariance matrix
Where 𝝂 means the degree of freedom.
𝝂 = 𝟏, 𝑪𝒂𝒖𝒄𝒉𝒚 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝝂 → ∞, 𝑵(𝒙|𝝁)
This becomes student’s t-distribution.
This means we are getting the distribution of normal under infinite
number of covariance values!
Note that t-distribution has thicker tail, which is connected to a “robustness”
That is, it is relatively less sensitive to the outliers.
Uni:
Multi:
Chapter 2.3. Gaussian Distribution
23
Mixtures of Gaussians
Till now, we considered models of uni-mode.
What will be the shape of a distribution with various modes??
That is, a linear combination of gaussian distributions!
Here, mixing coefficient 𝜋 should satisfy following conditions.
Here, each sub-distribution is called ‘component’.
Then, how can we estimate “Which component should we assign 𝑋𝑛 to?? ” → Let’s use bayes theorem!
To which component should we assign specific data?
Estimating of 𝝅, 𝝁, 𝚺 will be covered in chapter 9!
Via EM algorithm!
Chapter 2.4. Exponential Family
24
In mathematical statistics..
We have learned that distribution of exponential family is a complete distribution!
(Which satisfies 𝐸 𝑢 𝑋 = 0 𝑖𝑓 𝑎𝑛𝑑 𝑜𝑛𝑙𝑦 𝑖𝑓 𝑢 𝑋 = 0)
Definition
Exponential family of distributions over 𝑋, given parameter 𝜂, is defined to be the set of distributions of the form
We can consider 𝑔(𝜂) to be a normalization constant that makes integral to be 1.
Ex 1. Bernoulli To the shape of this equation!
𝝁 = 𝝈 𝜼 =
𝟏
𝟏 + 𝐞𝐱𝐩(−𝜼)
Chapter 2.4. Exponential Family
25
Multinomial distribution
Let’s extent previous example to the multinomial distribution!
Where 𝜂𝑘 = ln 𝜇𝑘
In fact, we didn’t consider the constraint
Let’s re-write the equation by…
𝟏 −
𝒌=𝟏
𝑴−𝟏
𝝁𝒌 = 𝝁𝑴
Chapter 2.4. Exponential Family
26
Gaussian distribution
Everything is same for gaussian. Gaussian is also an exponential family!
Maximum likelihood and sufficient statistics
For exponential family, it is easy to achieve the moments!
𝜕
𝜕𝜂
𝜕
𝜕𝜂
Similarly, we can get higher order moments
by simple derivative of this term!
Chapter 2.4. Exponential Family
27
Usage of sufficient statistics
Let’s calculate the likelihood function of exponential family!
By gradient, we get…
By using exponential and inverse function 𝑔−1
(. ), we can achieve MLE!
This result indicates that solution for 𝜼𝑴𝑳 only depends on sample mean of 𝒖(𝒙𝒏)!
This is a sufficient statistics!!
Conjugate prior
Rewind conjugate prior : Posterior has the same distribution to the prior distribution.
Note that there always exist conjugate prior for exponential family!
Prior
Likelihood
Posterior
Note that prior & posterior has
the same functional form!!
Chapter 2.4. Exponential Family
28
Noninformative priors
In Bayesian, we can set reasonable prior by using other information(knowledge)!
However, in some cases, we might not have enough (or accurate) information.
Here, we want to make the as small influence of prior as we can! → Non-informative prior
(This example is from prof. Kang’s lecture note.)
Consider binomial likelihood and beta prior & posterior.
If we do not have any prior information, it is reasonable to set 𝛼 = 𝛽 = 1.
Then, it becomes uniform distribution!
But there still exist an influence of prior by
𝟏
𝟐
!
Here, we can set 𝛼 = 𝛽 = 0, but then prior is not a pdf anymore.
Like this, if a prior itself is not a pdf, but its posterior is pdf(proper), such priors are called improper prior.
Note that,
I. Noninformative prior is not always improper.
II. Likewise, improper prior is not always noninformative!
They are not if and only if!
Chapter 2.4. Exponential Family
29
Noninformative priors
Translation invariance
With noninformative prior, there exist some difficulties regarding variable transformation.
Simply, think of this.
Function ℎ 𝜆 = 𝑐 where c is a constant. Set 𝜆 = 𝜂2
Then, it is obvious that ℎ 𝜆 = ℎ(𝜆2
).
However, pdf is different. We calculate Jacobian, thus…
Transformed pdf now depends on 𝜂, not constant anymore…
Note that this issue does not arise with likelihood function, since likelihood function takes the parameter as a given term!
Examples of noninformative prior
𝒙 = 𝒙 + 𝒄
𝝁 = 𝝁 + 𝒄
Thus, we need prior that gives equal probability to
𝐴 ≤ 𝜇 ≤ 𝐵 and 𝐴 − 𝑐 ≤ 𝜇 ≤ 𝐵 − 𝑐
Example can be gaussian
With 𝝈𝟎
𝟐
→ ∞
Scale invariance
𝒙 = 𝒄𝒙
𝝈 = 𝒄𝝈
Thus, we need prior that gives equal probability to
𝐴 ≤ 𝜇 ≤ 𝐵 and
𝐴
𝑐
≤ 𝜇 ≤
𝐵
𝑐
Example can be gaussian
With 𝒈𝒂𝒎(𝝀|𝟎, 𝟎)
Chapter 2.5. Nonparametric Methods
30
Nonparametric approach
Till now, we assumed specific type of a probability distribution.
Now, we are learning nonparametric density approach. We are trying to generate a density of data.
Histogram
We have learned histogram from ‘intro to stat’ to ‘nonparametric statistics’!
We are counting the number of data that falls into a 𝑖𝑡ℎ
bin!
Commonly, we set all Δ𝑖 to be same! (Constant)
- We do not need data anymore once distribution is computed.
- Use for quick visualization
- For multi-dim, M bins with D dimension data gets 𝑀𝐷
complexity!
- In histogram, to find the probability of a specific position of data, we need to
search the closest data!
- Furthermore, we need adequate value of bin, the delta.
Chapter 2.5. Nonparametric Methods
31
Kernel density method
Let’s generalize this idea. There is an unknown probability density 𝑝(𝑥).
Consider a specific region in data space, which is ℛ. Probability of data to be in ℛ is
Then, under i.i.d. assumption, each data has probability of P to fall into region ℛ! For N data, this can be expressed as binomial distribution,
I. By this, we can infer that 𝐸
𝐾
𝑁
= 𝑃 , 𝑉
𝐾
𝑁
=
𝑃 1−𝑃
𝑁
.
II. As 𝑁 → ∞, this distribution will be very sharp, (variance approximately zero), thus 𝐾 ≅ 𝑁𝑃
III. If ℛ is sufficiently small, integral can be considered as a square(?), thus 𝑃 ≅ 𝑝 𝑋 𝑉 , where V is a volume of ℛ.
IV. By joining the result of 𝐼𝐼 & 𝐼𝐼𝐼, we get
Kernel estimation assumes,
1. Region is very small!
2. Large number of data
Here, we can either fix K or V (Since N is a number of data, given!)
In kernel method, it is reasonable to think that we are fixing the data!
Chapter 2.5. Nonparametric Methods
32
Kernel density method
Before getting into detail kernel method, think of kernel function
That means, for a specific region 𝑋, we are counting the number of data points that fall into the region of 𝑿.
ℎ𝐷
: 𝑉𝑜𝑙𝑢𝑚𝑒 𝑜𝑓 ℎ𝑦𝑝𝑒𝑟 𝑐𝑢𝑏𝑒!
However, basic kernel method gives stair shape density,
Thus, we can use gaussian kernel to get some smooth shape!
Using Gaussian kernel…
(See how density changes as h changes…)
We can choose any kernel
That satisfies right conditions!
Do you have any idea??
Chapter 2.5. Nonparametric Methods
33
Nearest-neighbor methods
Hyperparameter ℎ depends on data and it is pretty hard to find adequate value of ℎ.
We can overcome this issue by using nearest-neighbor method!

More Related Content

What's hot

PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6Sunwoo Kim
 
Chapter 13 Linear Factor Models
Chapter 13 Linear Factor ModelsChapter 13 Linear Factor Models
Chapter 13 Linear Factor ModelsKyeongUkJang
 
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703Yoshinori Kabeya
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247Tomoki Hayashi
 
PRML Chapter 10
PRML Chapter 10PRML Chapter 10
PRML Chapter 10Sunwoo Kim
 
Prml3.5 エビデンス近似〜
Prml3.5 エビデンス近似〜Prml3.5 エビデンス近似〜
Prml3.5 エビデンス近似〜Yuki Matsubara
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」Keisuke Sugawara
 
PRML復々習レーン#3 3.1.3-3.1.5
PRML復々習レーン#3 3.1.3-3.1.5PRML復々習レーン#3 3.1.3-3.1.5
PRML復々習レーン#3 3.1.3-3.1.5sleepy_yoshi
 
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...Akihiro Nitta
 
基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章hiro5585
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.pptOmDalvi4
 
PRML輪読#9
PRML輪読#9PRML輪読#9
PRML輪読#9matsuolab
 
基礎からのベイズ統計学 3章(3.1~3.3)
基礎からのベイズ統計学 3章(3.1~3.3)基礎からのベイズ統計学 3章(3.1~3.3)
基礎からのベイズ統計学 3章(3.1~3.3)TeranishiKeisuke
 
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)Taro Tezuka
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7Sunwoo Kim
 
PRML勉強会第3回 2章前半 2013/11/28
PRML勉強会第3回 2章前半 2013/11/28PRML勉強会第3回 2章前半 2013/11/28
PRML勉強会第3回 2章前半 2013/11/28kurotaki_weblab
 
PRML輪読#5
PRML輪読#5PRML輪読#5
PRML輪読#5matsuolab
 

What's hot (20)

PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6
 
Chapter 13 Linear Factor Models
Chapter 13 Linear Factor ModelsChapter 13 Linear Factor Models
Chapter 13 Linear Factor Models
 
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703
C:\D Drive\Prml\プレゼン\パターン認識と機械学習2 4章 D0703
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247
 
PRML Chapter 10
PRML Chapter 10PRML Chapter 10
PRML Chapter 10
 
Prml3.5 エビデンス近似〜
Prml3.5 エビデンス近似〜Prml3.5 エビデンス近似〜
Prml3.5 エビデンス近似〜
 
PRML 5.5
PRML 5.5PRML 5.5
PRML 5.5
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
PRMLrevenge_3.3
PRMLrevenge_3.3PRMLrevenge_3.3
PRMLrevenge_3.3
 
PRML復々習レーン#3 3.1.3-3.1.5
PRML復々習レーン#3 3.1.3-3.1.5PRML復々習レーン#3 3.1.3-3.1.5
PRML復々習レーン#3 3.1.3-3.1.5
 
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
 
基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
Bayesian network
Bayesian networkBayesian network
Bayesian network
 
PRML輪読#9
PRML輪読#9PRML輪読#9
PRML輪読#9
 
基礎からのベイズ統計学 3章(3.1~3.3)
基礎からのベイズ統計学 3章(3.1~3.3)基礎からのベイズ統計学 3章(3.1~3.3)
基礎からのベイズ統計学 3章(3.1~3.3)
 
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
Beta distribution and Dirichlet distribution (ベータ分布とディリクレ分布)
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
 
PRML勉強会第3回 2章前半 2013/11/28
PRML勉強会第3回 2章前半 2013/11/28PRML勉強会第3回 2章前半 2013/11/28
PRML勉強会第3回 2章前半 2013/11/28
 
PRML輪読#5
PRML輪読#5PRML輪読#5
PRML輪読#5
 

Similar to PRML Chapter 2

Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designssmackinnon
 
Random Forests without the Randomness June 16_2023.pptx
 Random Forests without the Randomness June 16_2023.pptx Random Forests without the Randomness June 16_2023.pptx
Random Forests without the Randomness June 16_2023.pptxKirkMonteverde
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihoodNatan Katz
 
Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methodsStephen Bradley
 
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docx
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docxADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docx
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docxTaskiaSarkar
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregressionkongara
 
Measures of dispersion 5
Measures of dispersion 5Measures of dispersion 5
Measures of dispersion 5Sundar B N
 
Linear Classification
Linear ClassificationLinear Classification
Linear Classificationmailund
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningmichaelaaron25322
 
IDBR-DP-TaverneLambert
IDBR-DP-TaverneLambertIDBR-DP-TaverneLambert
IDBR-DP-TaverneLambertCedric Taverne
 
Probability & Information theory
Probability & Information theoryProbability & Information theory
Probability & Information theory성재 최
 

Similar to PRML Chapter 2 (20)

Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
 
Random Forests without the Randomness June 16_2023.pptx
 Random Forests without the Randomness June 16_2023.pptx Random Forests without the Randomness June 16_2023.pptx
Random Forests without the Randomness June 16_2023.pptx
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihood
 
Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methods
 
Bachelor's Thesis
Bachelor's ThesisBachelor's Thesis
Bachelor's Thesis
 
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docx
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docxADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docx
ADVANCED STATISTICS PREVIOUS YEAR QUESTIONS.docx
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
MModule 1 ppt.pptx
MModule 1 ppt.pptxMModule 1 ppt.pptx
MModule 1 ppt.pptx
 
K0230950102
K0230950102K0230950102
K0230950102
 
Measures of dispersion 5
Measures of dispersion 5Measures of dispersion 5
Measures of dispersion 5
 
Dissertation Paper
Dissertation PaperDissertation Paper
Dissertation Paper
 
Linear Classification
Linear ClassificationLinear Classification
Linear Classification
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Econometrics ch5
Econometrics ch5Econometrics ch5
Econometrics ch5
 
UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learning
 
IDBR-DP-TaverneLambert
IDBR-DP-TaverneLambertIDBR-DP-TaverneLambert
IDBR-DP-TaverneLambert
 
Transform idea
Transform ideaTransform idea
Transform idea
 
Probability & Information theory
Probability & Information theoryProbability & Information theory
Probability & Information theory
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

PRML Chapter 2

  • 1. Chapter 2 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 2. Probability Distribution 2 “Frequentist (or classical)” vs “Bayesian” Former assumes parameter is an unknown fixed constant, and we are trying to estimate it. Latter assumes parameter is a random variable which has its own distribution, and we are trying to estimate it. Second issue is parametric & non-parametric. Some of you might took non-parametric statistics, even if you didn’t it really doesn’t matter. For parametric, we assume a specific form of a distribution. (𝑋 ~ 𝑁(𝜇𝑋, Σ𝑥)) On the other hand, for non-parametric, we do not assume specific form! That does not mean we do not use distribution! We still use distribution, but we try to approximately find distribution! In this chapter, we are going to cover various distributions. Furthermore, we are going to take a look at prior & posterior distribution of Bayesian statistics! Okay now let’s get it! Introduction of chp2
  • 3. Chapter 2.1. Binary variables 3 Consider coin toss! Response variable takes a form of binary variable! That is, 𝑥 ∈ {0, 1} Probability can be approximated by 𝑝 𝑥 = 1 𝜇 = 𝜇 In mathematical statistics II, this was 𝑝 𝑋 = 1 𝜃 = 𝜃. (They are the same!) This is a well-known distribution, a Bernoulli distribution! Bernoulli distribution When we observe multiple trials, likelihood function is becoming Computational convenience 𝒍𝒐𝒈 We can compute MLE from this likelihood function! (It’s not very hard! Note that this MLE is a summation of observed value, and it is “sufficient statistics!” (This also was covered in mathematical statistics II!)
  • 4. Chapter 2.1. Binary variables 4 Definition is… (From Prof. Kang’s Mathematical Statistics II lecture node!) Review. Sufficient statistics For short, it is a statistics which can make conditional distribution does not depend on unknown parameter. Intuitively, it absorbs the parameter’s information, and makes certain distribution easy to process! This plays an important role in finding MVUE(min. variance unbiased estimator) It’s importance in this book will be introduced soon.
  • 5. Chapter 2.1. Binary variables 5 Binomial distribution It is an expansion of Bernoulli distribution to data size N. That is “How many 𝑋 = 1 observed from N trials?” Note that each trials are “independent”, thus mean and variance can be simply extend from basic Bernoulli. Beta distribution Note that above treatment was a “frequentist approach”. Let’s consider Bayes approach. Here, let’s condiser 𝜇 is a random variable and has its own distribution! Then, which form should this 𝜇 be? Probably, beta distribution! (Mathematical statistics I) Beta distribution’s domain is [0, 1]. Thus, it is great to approximate probability distribution!
  • 6. Chapter 2.1. Binary variables 6 Beta distribution Famous for prior distribution of binomial family! Note that 0 < 𝜇 < 1 satisfies! Conjugate prior If posterior distribution of a certain model is same as it’s prior, then that prior is called “Conjugate prior!” This is important since we can get posterior in a much simple way!
  • 7. Chapter 2.1. Binary variables 7 Bayesian posterior & Predictive distribution Likelihood function Prior Here, denominator does not depend on parameter, thus we can ignore now (This is actually probability normalizer!) Here, likelihood is product of Bernoulli, and prior is beta. Thus, it becomes Prior distribution can be the information from related data or our prior belief! Posterior distribution changes as it merges information achieved via observed samples!
  • 8. Chapter 2.1. Binary variables 8 Predictive distribution Note that 𝜇 is a random variable under bayes treatment. Thus, in order to generate prediction, we have to marginalize out 𝜇. Here, we can get some important intuition. Likelihood : 𝑚 ∶ 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 & 𝑙 ∶ 𝑁 − 𝑚 Prior : 𝑎 & 𝑏 𝑝𝑎𝑟𝑎𝑚 𝑚 + 𝑙 𝑚 + 𝑎 + 𝑙 + 𝑏 ∗ 𝑚 𝑚 + 𝑙 + 𝑎 + 𝑏 𝑚 + 𝑎 + 𝑙 + 𝑏 ∗ 𝑎 𝑎 + 𝑏 = 𝑚 + 𝑎 𝑚 + 𝑎 + 𝑙 + 𝑏 MLE Prior This means posterior mean is a weighted average of MLE and prior mean! As the number of data increases, prior’s weight decrease, and MLE’s weight is increasing!! Furthermore, posterior variance is becoming less as data is keep observed, and ≤ than prior! This can be generalized by law of total expectation and variance Prior Posterior
  • 9. Chapter 2.2. Multinomial Variables 9 Multiclass classification Consider MNIST which has various output value! In previous chapter, we only thought about two classes, 𝜇 & 1 − 𝜇. Now, there are various classes, so, Likelihood is defined as For the multiple trial, likelihood becomes, To estimate parameters, we again use MLE, but we have constraint of 𝜇𝑘 = 1 Here, 𝝀 = −𝑵 is obtained by using lagrangian optimization! Just like binomial, This can be extended!!
  • 10. Chapter 2.2. Multinomial Variables 10 Dirichlet distribution Now, again we have to think of prior distribution for multinomial distribution! Consider beta distribution was So, we can say Dirichlet distribution is an extension of beta to the multi-class version. 0 ≤ 𝜇𝑘 ≤ 1 and 𝜇𝑘 = 1. Where 𝜶 denotes 𝛼1, 𝛼2, … , 𝛼𝐾 𝑇 Entire process is just an extension of previous binomial-beta pair!
  • 11. Chapter 2.3. Gaussian Distribution 11 Introduction to Gaussian distribution For statistician, Gaussian(Normal) distribution is one of the most important distribution! It has some nice properties, especially central limit theorem! Review. C.L.T. Let 𝑋1, 𝑋2, … , 𝑋𝑛 be a random sample from a distribution which has mean 𝜇 and variance 𝜎2 < ∞. Then, 𝑌𝑛 = 𝑛 𝑋𝑛 − 𝜇 𝜎 has limiting distribution of 𝑁(0,1) Review Delta Method. Consider there exist a continuous function 𝑢(𝑋𝑛), then 𝑢(𝑋𝑛) follows 𝑢 𝑋𝑛 ~ 𝑁( 𝑢 𝜇 , 𝑢′ 𝜇 2 ∗ 𝜎2 𝑛 ) / This can be easily proven by taylor series expansion! Univariate Multivariate
  • 12. Chapter 2.3. Gaussian Distribution 12 Analytical properties of gaussian distribution Look at this quadratic form part! ∆𝟐 = 𝑿 − 𝝁 𝑻 𝚺−𝟏 (𝑿 − 𝝁) This equation is equivalent to the ‘Mahalanobis distance!’ Since Σ is a symmetric real value matrix, we can perform spectral(eigen value) decomposition on it! Here, we can set 𝑢𝑖 to be the orthonormal set! Then, they satisfy Note that inverse matrix can be expressed by the inverse of eigen values By using this, we can re-write ∆ value by below method! This result gives following geometrical intuition!
  • 13. Chapter 2.3. Gaussian Distribution 13 Standardization of normal distribution We have done Z = 𝑋−𝜇 𝜎 for numerous time! This holds same in multivariate version! Rewind this transformation! This means that Jacobian, |𝑱| is equal to 1! This transformation implies some important idea! I. Transformation 𝑋 → 𝑌 makes each variable independent! II. Overall probability is expressed as a product of each independent normal distribution! III. This is still probability density function, since is integrated to the 1. IV. Geometrically, this means that distribution is shifted and rotated! Below shows the moment of normal distribution, but let’s skip since we all know.
  • 14. Chapter 2.3. Gaussian Distribution 14 Conditional normal distribution Let’s think of gaussian distribution of joint subset! Overall distribution is 𝑋~𝑁(𝒙, 𝚺). Let’s partition this to 𝑋𝑎, 𝑋𝑏. Then… First, what we want to achieve is 𝑝(𝑋𝑎|𝑋𝑏). This is a 𝑝 𝑋𝑎,𝑋𝑏 𝑝(𝑋𝑏) . Since 𝑝(𝑋𝑎|X𝑏) depends on 𝑋𝑎, we have to find the form of 𝑝(𝑋𝑎, 𝑋𝑏)! Let’s re-write the exponent! Still function is an exponent form of 𝑋𝑎. Thus, we can infer the conditional distribution also becomes Gaussian! Overall calculation is not necessary, and we covered it in multivariate analysis! Let’s remember this, since it will be continouesly used !
  • 15. Chapter 2.3. Gaussian Distribution 15 Marginal normal distribution Again, let’s assume 𝑝 𝑋 = 𝑝 𝑋𝑎, 𝑋𝑏 → 𝑁(𝜇, Σ). Then, what is a distribution of 𝑝(𝑋𝑎)?? Here, calculation is a bit tricky. Just keep in mind, we can find it again has normal distribution!! Summary
  • 16. Chapter 2.3. Gaussian Distribution 16 Bayes theorem for Gaussian distribution These equations are very useful, and we are using this to easily compute posterior and predictive distribution!! It was derived by (Result is much important!)
  • 17. Chapter 2.3. Gaussian Distribution 17 Parameter estimation Note that μ & Σ are the unknown parameters! Thus, we have to estimate them by MLE!! By derivative… we’ve done it so~ many~ times~ in mathematical statistics II, let’s skip the procedure! Note that estimator of covariance is a bias estimator! Sequential estimation This gives pretty interesting intuition! Take a look at following functional value! Consider data 𝑋1, 𝑋2, … , 𝑋𝑁−1 were observed, and we just observed 𝑋𝑁. Then, mean is moving towards 𝑋𝑁 − 𝜇(𝑁−1) a bit, as much as 1 𝑁 ! This gives clear intuition of sequential numerical approach. However, MLE and sequential result are not always same.
  • 18. Chapter 2.3. Gaussian Distribution 18 Robbins and Monro algorithm There are various methods in finding root point! Let’s see Robbins & Monro method! There are some assumptions, a. Conditional variance is finite : 𝐸 𝑧 − 𝑓 2 𝜃 < ∞ b. For 𝑓 𝜃 > 0 , 𝜃 > 𝜃★ c. For 𝑓 𝜃 < 0 , 𝜃 < 𝜃★ Equation is… (Here 𝑧(𝜃) is an observed value with estimated theta!) Where 𝑎𝑁 is a sequence of positive numbers that satisfy This can be applied to MLE! (Since sometimes derivative = 0 is hard to find) For example, for mean, equation can be..
  • 19. Chapter 2.3. Gaussian Distribution 19 Bayesian inference for the Gaussian mean (Normal - Normal) Here, we are trying to find the posterior and prior of a gaussian distribution! For parameters, there are 𝜇 & Σ, and we are trying to generate distribution for them. First, let’s think of mean parameter 𝜇 We can think of likelihood function as an exponent form of 𝜇, and this implies prior for 𝝁 can also be Gaussian! We have set Prior : Gaussian Likelihood : Gaussian And outcome posterior also becomes Gaussian! * I skipped detail calculation Note that mean of posterior is a weighted average of prior mean and likelihood mean! Here, as 𝑁 → ∞, precision goes infinity! Changes of posterior of 𝝁
  • 20. Chapter 2.3. Gaussian Distribution 20 Sequential approach Consider we have observed 𝑥1, … , 𝑥𝑁−1, and we just observed 𝑥𝑁. Previous posterior can be re-expressed as… Prior New likelihood Inference of variance (Gamma – Gamma pair) We assumed we know variance, but now we move on to the unknown variance! Here, let 𝜆 = 1 𝜎2, the precision! Then likelihood function becomes… We are trying to set “Conjugate prior”. We use gamma distribution as prior! Final posterior becomes 𝐺𝑎𝑚𝑚𝑎(𝑎𝑁, 𝑏𝑁)
  • 21. Chapter 2.3. Gaussian Distribution 21 Multi parameters Now, let’s think of case where we don’t know 𝜇 & Σ together! Normally, it is reasonable to assume that p 𝜇 𝜆 follows gaussian distribution, and its precision is a linear function 𝜆. Thus, we can rewrite joint probability as 𝑝 𝜇, 𝜆 = 𝑝 𝜇 𝜆 𝑝 𝜆 = 𝑁 𝜇 𝜇0, 𝛽𝜆 −1 𝐺𝑎𝑚𝑚𝑎(𝜆|𝑎, 𝑏). Here, 𝜇0 = 𝑐 𝛽 , 𝑎 = 1 + 𝛽 2 , 𝑏 = 𝑑 − 𝑐2 2𝛽 This distribution is known as normal-gamma distribution! Remember we are managing conjugate prior! Normal-gamma Wishart distribution For the multivariate gaussian, We are trying to form a distribution of covariance, for known mean! Here, conjugate prior becomes Here, 𝑊 is a scale matrix of size 𝐷 𝑋 𝐷 If both mean and covariance is unknown, then the prior can be given as Normal Wishart!
  • 22. Chapter 2.3. Gaussian Distribution 22 Student’s t-distribution We all know t-distribution so well that it follows, 𝑡 ~ 𝑍 𝑉 𝑘 , Where Z : Standard normal distribution, V : Chi-square distribution, k : d.f. of chisquare / and Z – V are independent! Here, there are 𝑁(𝑥|𝜇, 𝜏−1 ) together with gamma prior 𝐺𝑎𝑚(𝜏|𝑎, 𝑏). Can we make t-distribution with these values? We are trying to find the distribution of x under various variance values. Then, let’s marginalize out covariance matrix Where 𝝂 means the degree of freedom. 𝝂 = 𝟏, 𝑪𝒂𝒖𝒄𝒉𝒚 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝝂 → ∞, 𝑵(𝒙|𝝁) This becomes student’s t-distribution. This means we are getting the distribution of normal under infinite number of covariance values! Note that t-distribution has thicker tail, which is connected to a “robustness” That is, it is relatively less sensitive to the outliers. Uni: Multi:
  • 23. Chapter 2.3. Gaussian Distribution 23 Mixtures of Gaussians Till now, we considered models of uni-mode. What will be the shape of a distribution with various modes?? That is, a linear combination of gaussian distributions! Here, mixing coefficient 𝜋 should satisfy following conditions. Here, each sub-distribution is called ‘component’. Then, how can we estimate “Which component should we assign 𝑋𝑛 to?? ” → Let’s use bayes theorem! To which component should we assign specific data? Estimating of 𝝅, 𝝁, 𝚺 will be covered in chapter 9! Via EM algorithm!
  • 24. Chapter 2.4. Exponential Family 24 In mathematical statistics.. We have learned that distribution of exponential family is a complete distribution! (Which satisfies 𝐸 𝑢 𝑋 = 0 𝑖𝑓 𝑎𝑛𝑑 𝑜𝑛𝑙𝑦 𝑖𝑓 𝑢 𝑋 = 0) Definition Exponential family of distributions over 𝑋, given parameter 𝜂, is defined to be the set of distributions of the form We can consider 𝑔(𝜂) to be a normalization constant that makes integral to be 1. Ex 1. Bernoulli To the shape of this equation! 𝝁 = 𝝈 𝜼 = 𝟏 𝟏 + 𝐞𝐱𝐩(−𝜼)
  • 25. Chapter 2.4. Exponential Family 25 Multinomial distribution Let’s extent previous example to the multinomial distribution! Where 𝜂𝑘 = ln 𝜇𝑘 In fact, we didn’t consider the constraint Let’s re-write the equation by… 𝟏 − 𝒌=𝟏 𝑴−𝟏 𝝁𝒌 = 𝝁𝑴
  • 26. Chapter 2.4. Exponential Family 26 Gaussian distribution Everything is same for gaussian. Gaussian is also an exponential family! Maximum likelihood and sufficient statistics For exponential family, it is easy to achieve the moments! 𝜕 𝜕𝜂 𝜕 𝜕𝜂 Similarly, we can get higher order moments by simple derivative of this term!
  • 27. Chapter 2.4. Exponential Family 27 Usage of sufficient statistics Let’s calculate the likelihood function of exponential family! By gradient, we get… By using exponential and inverse function 𝑔−1 (. ), we can achieve MLE! This result indicates that solution for 𝜼𝑴𝑳 only depends on sample mean of 𝒖(𝒙𝒏)! This is a sufficient statistics!! Conjugate prior Rewind conjugate prior : Posterior has the same distribution to the prior distribution. Note that there always exist conjugate prior for exponential family! Prior Likelihood Posterior Note that prior & posterior has the same functional form!!
  • 28. Chapter 2.4. Exponential Family 28 Noninformative priors In Bayesian, we can set reasonable prior by using other information(knowledge)! However, in some cases, we might not have enough (or accurate) information. Here, we want to make the as small influence of prior as we can! → Non-informative prior (This example is from prof. Kang’s lecture note.) Consider binomial likelihood and beta prior & posterior. If we do not have any prior information, it is reasonable to set 𝛼 = 𝛽 = 1. Then, it becomes uniform distribution! But there still exist an influence of prior by 𝟏 𝟐 ! Here, we can set 𝛼 = 𝛽 = 0, but then prior is not a pdf anymore. Like this, if a prior itself is not a pdf, but its posterior is pdf(proper), such priors are called improper prior. Note that, I. Noninformative prior is not always improper. II. Likewise, improper prior is not always noninformative! They are not if and only if!
  • 29. Chapter 2.4. Exponential Family 29 Noninformative priors Translation invariance With noninformative prior, there exist some difficulties regarding variable transformation. Simply, think of this. Function ℎ 𝜆 = 𝑐 where c is a constant. Set 𝜆 = 𝜂2 Then, it is obvious that ℎ 𝜆 = ℎ(𝜆2 ). However, pdf is different. We calculate Jacobian, thus… Transformed pdf now depends on 𝜂, not constant anymore… Note that this issue does not arise with likelihood function, since likelihood function takes the parameter as a given term! Examples of noninformative prior 𝒙 = 𝒙 + 𝒄 𝝁 = 𝝁 + 𝒄 Thus, we need prior that gives equal probability to 𝐴 ≤ 𝜇 ≤ 𝐵 and 𝐴 − 𝑐 ≤ 𝜇 ≤ 𝐵 − 𝑐 Example can be gaussian With 𝝈𝟎 𝟐 → ∞ Scale invariance 𝒙 = 𝒄𝒙 𝝈 = 𝒄𝝈 Thus, we need prior that gives equal probability to 𝐴 ≤ 𝜇 ≤ 𝐵 and 𝐴 𝑐 ≤ 𝜇 ≤ 𝐵 𝑐 Example can be gaussian With 𝒈𝒂𝒎(𝝀|𝟎, 𝟎)
  • 30. Chapter 2.5. Nonparametric Methods 30 Nonparametric approach Till now, we assumed specific type of a probability distribution. Now, we are learning nonparametric density approach. We are trying to generate a density of data. Histogram We have learned histogram from ‘intro to stat’ to ‘nonparametric statistics’! We are counting the number of data that falls into a 𝑖𝑡ℎ bin! Commonly, we set all Δ𝑖 to be same! (Constant) - We do not need data anymore once distribution is computed. - Use for quick visualization - For multi-dim, M bins with D dimension data gets 𝑀𝐷 complexity! - In histogram, to find the probability of a specific position of data, we need to search the closest data! - Furthermore, we need adequate value of bin, the delta.
  • 31. Chapter 2.5. Nonparametric Methods 31 Kernel density method Let’s generalize this idea. There is an unknown probability density 𝑝(𝑥). Consider a specific region in data space, which is ℛ. Probability of data to be in ℛ is Then, under i.i.d. assumption, each data has probability of P to fall into region ℛ! For N data, this can be expressed as binomial distribution, I. By this, we can infer that 𝐸 𝐾 𝑁 = 𝑃 , 𝑉 𝐾 𝑁 = 𝑃 1−𝑃 𝑁 . II. As 𝑁 → ∞, this distribution will be very sharp, (variance approximately zero), thus 𝐾 ≅ 𝑁𝑃 III. If ℛ is sufficiently small, integral can be considered as a square(?), thus 𝑃 ≅ 𝑝 𝑋 𝑉 , where V is a volume of ℛ. IV. By joining the result of 𝐼𝐼 & 𝐼𝐼𝐼, we get Kernel estimation assumes, 1. Region is very small! 2. Large number of data Here, we can either fix K or V (Since N is a number of data, given!) In kernel method, it is reasonable to think that we are fixing the data!
  • 32. Chapter 2.5. Nonparametric Methods 32 Kernel density method Before getting into detail kernel method, think of kernel function That means, for a specific region 𝑋, we are counting the number of data points that fall into the region of 𝑿. ℎ𝐷 : 𝑉𝑜𝑙𝑢𝑚𝑒 𝑜𝑓 ℎ𝑦𝑝𝑒𝑟 𝑐𝑢𝑏𝑒! However, basic kernel method gives stair shape density, Thus, we can use gaussian kernel to get some smooth shape! Using Gaussian kernel… (See how density changes as h changes…) We can choose any kernel That satisfies right conditions! Do you have any idea??
  • 33. Chapter 2.5. Nonparametric Methods 33 Nearest-neighbor methods Hyperparameter ℎ depends on data and it is pretty hard to find adequate value of ℎ. We can overcome this issue by using nearest-neighbor method!