SlideShare a Scribd company logo
1 of 164
Download to read offline
Machine Learning for Data Mining
Maximum A Posteriori (MAP)
Andres Mendez-Vazquez
July 22, 2015
1 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
2 / 66
Introduction
We go back to the Bayesian Rule
p (Θ|X) =
p (X|Θ) p (Θ)
p (X)
(1)
We now seek that value for Θ, called ΘMAP
It allows to maximize the posterior p (Θ|X)
3 / 66
Introduction
We go back to the Bayesian Rule
p (Θ|X) =
p (X|Θ) p (Θ)
p (X)
(1)
We now seek that value for Θ, called ΘMAP
It allows to maximize the posterior p (Θ|X)
3 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
4 / 66
Development of the solution
We look to maximize ΘMAP
ΘMAP = argmax
Θ
p (Θ)
= argmax
Θ
p (X|Θ) p (Θ)
P (X)
≈ argmax
Θ
p (X|Θ) p (Θ)
= argmax
Θ xi∈X
p (xi|Θ) p (Θ)
P (X) can be removed because it has no functional relation with Θ.
5 / 66
Development of the solution
We look to maximize ΘMAP
ΘMAP = argmax
Θ
p (Θ)
= argmax
Θ
p (X|Θ) p (Θ)
P (X)
≈ argmax
Θ
p (X|Θ) p (Θ)
= argmax
Θ xi∈X
p (xi|Θ) p (Θ)
P (X) can be removed because it has no functional relation with Θ.
5 / 66
Development of the solution
We look to maximize ΘMAP
ΘMAP = argmax
Θ
p (Θ)
= argmax
Θ
p (X|Θ) p (Θ)
P (X)
≈ argmax
Θ
p (X|Θ) p (Θ)
= argmax
Θ xi∈X
p (xi|Θ) p (Θ)
P (X) can be removed because it has no functional relation with Θ.
5 / 66
Development of the solution
We look to maximize ΘMAP
ΘMAP = argmax
Θ
p (Θ)
= argmax
Θ
p (X|Θ) p (Θ)
P (X)
≈ argmax
Θ
p (X|Θ) p (Θ)
= argmax
Θ xi∈X
p (xi|Θ) p (Θ)
P (X) can be removed because it has no functional relation with Θ.
5 / 66
Development of the solution
We look to maximize ΘMAP
ΘMAP = argmax
Θ
p (Θ)
= argmax
Θ
p (X|Θ) p (Θ)
P (X)
≈ argmax
Θ
p (X|Θ) p (Θ)
= argmax
Θ xi∈X
p (xi|Θ) p (Θ)
P (X) can be removed because it has no functional relation with Θ.
5 / 66
We can make this easier
Use logarithms
ΘMAP = argmax
Θ


xi∈X
log p (xi|Θ) + log p (Θ)

 (2)
6 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
7 / 66
What Does the MAP Estimate Get
Something Notable
The MAP estimate allows us to inject into the estimation calculation our
prior beliefs regarding the parameters values in Θ.
For example
Let’s conduct N independent trials of the following Bernoulli experiment
with q parameter:
We will ask each individual we run into in the hallway whether they
will vote PRI or PAN in the next presidential election.
With probability p to vote PRI
Where the values of xi is either PRI or PAN.
8 / 66
What Does the MAP Estimate Get
Something Notable
The MAP estimate allows us to inject into the estimation calculation our
prior beliefs regarding the parameters values in Θ.
For example
Let’s conduct N independent trials of the following Bernoulli experiment
with q parameter:
We will ask each individual we run into in the hallway whether they
will vote PRI or PAN in the next presidential election.
With probability p to vote PRI
Where the values of xi is either PRI or PAN.
8 / 66
What Does the MAP Estimate Get
Something Notable
The MAP estimate allows us to inject into the estimation calculation our
prior beliefs regarding the parameters values in Θ.
For example
Let’s conduct N independent trials of the following Bernoulli experiment
with q parameter:
We will ask each individual we run into in the hallway whether they
will vote PRI or PAN in the next presidential election.
With probability p to vote PRI
Where the values of xi is either PRI or PAN.
8 / 66
What Does the MAP Estimate Get
Something Notable
The MAP estimate allows us to inject into the estimation calculation our
prior beliefs regarding the parameters values in Θ.
For example
Let’s conduct N independent trials of the following Bernoulli experiment
with q parameter:
We will ask each individual we run into in the hallway whether they
will vote PRI or PAN in the next presidential election.
With probability p to vote PRI
Where the values of xi is either PRI or PAN.
8 / 66
First the ML estimate
Samples
X = xi =
PAN
PRI
i = 1, ..., N (3)
The log likelihood function
log P (X|p) =
N
i=1
log p (xi|q)
=
i
log p (xi = PRI|q) + ...
i
log p (xi = PAN|1 − q)
=nPRI log q + (N − nPRI ) log (1 − q)
Where nPRI are the numbers of individuals who are planning to vote PRI
this fall 9 / 66
First the ML estimate
Samples
X = xi =
PAN
PRI
i = 1, ..., N (3)
The log likelihood function
log P (X|p) =
N
i=1
log p (xi|q)
=
i
log p (xi = PRI|q) + ...
i
log p (xi = PAN|1 − q)
=nPRI log q + (N − nPRI ) log (1 − q)
Where nPRI are the numbers of individuals who are planning to vote PRI
this fall 9 / 66
First the ML estimate
Samples
X = xi =
PAN
PRI
i = 1, ..., N (3)
The log likelihood function
log P (X|p) =
N
i=1
log p (xi|q)
=
i
log p (xi = PRI|q) + ...
i
log p (xi = PAN|1 − q)
=nPRI log q + (N − nPRI ) log (1 − q)
Where nPRI are the numbers of individuals who are planning to vote PRI
this fall 9 / 66
First the ML estimate
Samples
X = xi =
PAN
PRI
i = 1, ..., N (3)
The log likelihood function
log P (X|p) =
N
i=1
log p (xi|q)
=
i
log p (xi = PRI|q) + ...
i
log p (xi = PAN|1 − q)
=nPRI log q + (N − nPRI ) log (1 − q)
Where nPRI are the numbers of individuals who are planning to vote PRI
this fall 9 / 66
First the ML estimate
Samples
X = xi =
PAN
PRI
i = 1, ..., N (3)
The log likelihood function
log P (X|p) =
N
i=1
log p (xi|q)
=
i
log p (xi = PRI|q) + ...
i
log p (xi = PAN|1 − q)
=nPRI log q + (N − nPRI ) log (1 − q)
Where nPRI are the numbers of individuals who are planning to vote PRI
this fall 9 / 66
We use our classic tricks
By setting
L = log P (X|q) (4)
We have that
∂L
∂q
= 0 (5)
Thus
nPRI
q
−
(N − nPRI )
(1 − q)
= 0 (6)
10 / 66
We use our classic tricks
By setting
L = log P (X|q) (4)
We have that
∂L
∂q
= 0 (5)
Thus
nPRI
q
−
(N − nPRI )
(1 − q)
= 0 (6)
10 / 66
We use our classic tricks
By setting
L = log P (X|q) (4)
We have that
∂L
∂q
= 0 (5)
Thus
nPRI
q
−
(N − nPRI )
(1 − q)
= 0 (6)
10 / 66
Final Solution of ML
We get
qPRI =
nPRI
N
(7)
Thus
If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6.
11 / 66
Final Solution of ML
We get
qPRI =
nPRI
N
(7)
Thus
If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6.
11 / 66
Building the MAP estimate
Obviously we need a prior belief distribution
We have the following constraints:
The prior for q must be zero outside the [0,1] interval.
Within the [0,1] interval, we are free to specify our beliefs in any way
we wish.
In most cases, we would want to choose a distribution for the prior
beliefs that peaks somewhere in the [0, 1] interval.
We assume the following
The state of Colima has traditionally voted PRI in presidential elections.
However, on account of the prevailing economic conditions, the voters are
more likely to vote PAN in the election in question.
12 / 66
Building the MAP estimate
Obviously we need a prior belief distribution
We have the following constraints:
The prior for q must be zero outside the [0,1] interval.
Within the [0,1] interval, we are free to specify our beliefs in any way
we wish.
In most cases, we would want to choose a distribution for the prior
beliefs that peaks somewhere in the [0, 1] interval.
We assume the following
The state of Colima has traditionally voted PRI in presidential elections.
However, on account of the prevailing economic conditions, the voters are
more likely to vote PAN in the election in question.
12 / 66
Building the MAP estimate
Obviously we need a prior belief distribution
We have the following constraints:
The prior for q must be zero outside the [0,1] interval.
Within the [0,1] interval, we are free to specify our beliefs in any way
we wish.
In most cases, we would want to choose a distribution for the prior
beliefs that peaks somewhere in the [0, 1] interval.
We assume the following
The state of Colima has traditionally voted PRI in presidential elections.
However, on account of the prevailing economic conditions, the voters are
more likely to vote PAN in the election in question.
12 / 66
Building the MAP estimate
Obviously we need a prior belief distribution
We have the following constraints:
The prior for q must be zero outside the [0,1] interval.
Within the [0,1] interval, we are free to specify our beliefs in any way
we wish.
In most cases, we would want to choose a distribution for the prior
beliefs that peaks somewhere in the [0, 1] interval.
We assume the following
The state of Colima has traditionally voted PRI in presidential elections.
However, on account of the prevailing economic conditions, the voters are
more likely to vote PAN in the election in question.
12 / 66
Building the MAP estimate
Obviously we need a prior belief distribution
We have the following constraints:
The prior for q must be zero outside the [0,1] interval.
Within the [0,1] interval, we are free to specify our beliefs in any way
we wish.
In most cases, we would want to choose a distribution for the prior
beliefs that peaks somewhere in the [0, 1] interval.
We assume the following
The state of Colima has traditionally voted PRI in presidential elections.
However, on account of the prevailing economic conditions, the voters are
more likely to vote PAN in the election in question.
12 / 66
What prior distribution can we use?
We could use a Beta distribution being parametrized by two values α
and β
P (q) =
1
B (α, β)
qα−1
(1 − q)β−1
. (8)
Where
We have B (α, β) = Γ(α)Γ(β)
Γ(α+β) is the beta function where Γ is the
generalization of the notion of factorial in the case of the real numbers.
Properties
When both the α, β > 0 then the beta distribution has its mode
(Maximum value) at
α − 1
α + β − 2
. (9)
13 / 66
What prior distribution can we use?
We could use a Beta distribution being parametrized by two values α
and β
P (q) =
1
B (α, β)
qα−1
(1 − q)β−1
. (8)
Where
We have B (α, β) = Γ(α)Γ(β)
Γ(α+β) is the beta function where Γ is the
generalization of the notion of factorial in the case of the real numbers.
Properties
When both the α, β > 0 then the beta distribution has its mode
(Maximum value) at
α − 1
α + β − 2
. (9)
13 / 66
What prior distribution can we use?
We could use a Beta distribution being parametrized by two values α
and β
P (q) =
1
B (α, β)
qα−1
(1 − q)β−1
. (8)
Where
We have B (α, β) = Γ(α)Γ(β)
Γ(α+β) is the beta function where Γ is the
generalization of the notion of factorial in the case of the real numbers.
Properties
When both the α, β > 0 then the beta distribution has its mode
(Maximum value) at
α − 1
α + β − 2
. (9)
13 / 66
We then do the following
We do the following
We can choose α = β so the beta prior peaks at 0.5.
As a further expression of our belief
We make the following choice α = β = 5.
Why? Look at the variance of the beta distribution
αβ
(α + β) (α + β + 1)
. (10)
14 / 66
We then do the following
We do the following
We can choose α = β so the beta prior peaks at 0.5.
As a further expression of our belief
We make the following choice α = β = 5.
Why? Look at the variance of the beta distribution
αβ
(α + β) (α + β + 1)
. (10)
14 / 66
We then do the following
We do the following
We can choose α = β so the beta prior peaks at 0.5.
As a further expression of our belief
We make the following choice α = β = 5.
Why? Look at the variance of the beta distribution
αβ
(α + β) (α + β + 1)
. (10)
14 / 66
Thus, we have the following nice properties
We have a variance with α = β = 5
Var (q) ≈ 0.025
Thus, the standard deviation
sd ≈ 0.16 which is a nice dispersion at the peak point!!!
15 / 66
Thus, we have the following nice properties
We have a variance with α = β = 5
Var (q) ≈ 0.025
Thus, the standard deviation
sd ≈ 0.16 which is a nice dispersion at the peak point!!!
15 / 66
Now, our MAP estimate for p...
We have then
pMAP = argmax
Θ


xi∈X
log p (xi|q) + log p (q)

 (11)
Plugging back the ML
pMAP = argmax
Θ
[nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12)
Where
log P (p) = log
1
B (α, β)
qα−1
(1 − q)β−1
(13)
16 / 66
Now, our MAP estimate for p...
We have then
pMAP = argmax
Θ


xi∈X
log p (xi|q) + log p (q)

 (11)
Plugging back the ML
pMAP = argmax
Θ
[nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12)
Where
log P (p) = log
1
B (α, β)
qα−1
(1 − q)β−1
(13)
16 / 66
Now, our MAP estimate for p...
We have then
pMAP = argmax
Θ


xi∈X
log p (xi|q) + log p (q)

 (11)
Plugging back the ML
pMAP = argmax
Θ
[nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12)
Where
log P (p) = log
1
B (α, β)
qα−1
(1 − q)β−1
(13)
16 / 66
The log of P (p)
We have that
log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14)
Now taking the derivative with respect to p, we get
nPRI
q
−
(N − nPRI )
(1 − q)
−
β − 1
1 − q
+
α − 1
q
= 0 (15)
Thus
qMAP =
nPRI + α − 1
N + α + β − 2
(16)
17 / 66
The log of P (p)
We have that
log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14)
Now taking the derivative with respect to p, we get
nPRI
q
−
(N − nPRI )
(1 − q)
−
β − 1
1 − q
+
α − 1
q
= 0 (15)
Thus
qMAP =
nPRI + α − 1
N + α + β − 2
(16)
17 / 66
The log of P (p)
We have that
log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14)
Now taking the derivative with respect to p, we get
nPRI
q
−
(N − nPRI )
(1 − q)
−
β − 1
1 − q
+
α − 1
q
= 0 (15)
Thus
qMAP =
nPRI + α − 1
N + α + β − 2
(16)
17 / 66
Now
With N = 20 with nPRI = 12 and α = β = 5
qMAP = 0.571
18 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
19 / 66
Properties
First
MAP estimation “pulls” the estimate toward the prior.
Second
The more focused our prior belief, the larger the pull toward the prior.
Example
If α = β=equal to large value
It will make the MAP estimate to move closer to the prior.
20 / 66
Properties
First
MAP estimation “pulls” the estimate toward the prior.
Second
The more focused our prior belief, the larger the pull toward the prior.
Example
If α = β=equal to large value
It will make the MAP estimate to move closer to the prior.
20 / 66
Properties
First
MAP estimation “pulls” the estimate toward the prior.
Second
The more focused our prior belief, the larger the pull toward the prior.
Example
If α = β=equal to large value
It will make the MAP estimate to move closer to the prior.
20 / 66
Properties
Third
In the expression we derived for qMAP, the parameters α and β play a
“smoothing” role vis-a-vis the measurement nPRI .
Fourth
Since we referred to q as the parameter to be estimated, we can refer to α
and β as the hyper-parameters in the estimation calculations.
21 / 66
Properties
Third
In the expression we derived for qMAP, the parameters α and β play a
“smoothing” role vis-a-vis the measurement nPRI .
Fourth
Since we referred to q as the parameter to be estimated, we can refer to α
and β as the hyper-parameters in the estimation calculations.
21 / 66
Beyond simple derivation
In the previous technique
We took an logarithm of the likelihoot × the prior to obtain a function that
can be derived in order to obtain each of the parameters to be estimated.
What if we cannot derive?
For example when we have something like |θi|.
We can try the following
EM + MAP to be able to estimate the sought parameters.
22 / 66
Beyond simple derivation
In the previous technique
We took an logarithm of the likelihoot × the prior to obtain a function that
can be derived in order to obtain each of the parameters to be estimated.
What if we cannot derive?
For example when we have something like |θi|.
We can try the following
EM + MAP to be able to estimate the sought parameters.
22 / 66
Beyond simple derivation
In the previous technique
We took an logarithm of the likelihoot × the prior to obtain a function that
can be derived in order to obtain each of the parameters to be estimated.
What if we cannot derive?
For example when we have something like |θi|.
We can try the following
EM + MAP to be able to estimate the sought parameters.
22 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
23 / 66
Example
This application comes from
“Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo
In
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003
24 / 66
Example
This application comes from
“Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo
In
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003
24 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
25 / 66
Introduction: Linear Regression with Gaussian Prior
We consider regression functions that are linear with respect to the
parameter vector β
f (x, β) =
k
i=1
βih (x) = βT
h (x)
Where
h (x) = [h1 (x) , ..., hk (x)]T
is a vector of k fixed function of the input,
often called features.
26 / 66
Introduction: Linear Regression with Gaussian Prior
We consider regression functions that are linear with respect to the
parameter vector β
f (x, β) =
k
i=1
βih (x) = βT
h (x)
Where
h (x) = [h1 (x) , ..., hk (x)]T
is a vector of k fixed function of the input,
often called features.
26 / 66
Actually, it can be...
Linear Regression
Linear regression, in which h (x) = [1, x1, ..., xd]T
i; in this case, k = d + 1.
Non-Linear Regression
Here, you have a fixed basis function where
h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T
with φ1 (x) = 1.
Kernel Regression
Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T
where K (x, xi) is
some kernel function.
27 / 66
Actually, it can be...
Linear Regression
Linear regression, in which h (x) = [1, x1, ..., xd]T
i; in this case, k = d + 1.
Non-Linear Regression
Here, you have a fixed basis function where
h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T
with φ1 (x) = 1.
Kernel Regression
Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T
where K (x, xi) is
some kernel function.
27 / 66
Actually, it can be...
Linear Regression
Linear regression, in which h (x) = [1, x1, ..., xd]T
i; in this case, k = d + 1.
Non-Linear Regression
Here, you have a fixed basis function where
h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T
with φ1 (x) = 1.
Kernel Regression
Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T
where K (x, xi) is
some kernel function.
27 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
28 / 66
Gaussian Noise
We assume that the training set is contaminated by additive white
Gaussian Noise
yi = f (xi, β) + ωi (17)
for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean
Gaussian samples with variance σ2
Thus, for [y1, ..., yN ], we have the following likelihood
p (y|β) = N Hβ, σ2
I (18)
The posterior p (β|y) is still Gaussian and the mode is given by
β = σ2
I + HT
H
−1
HT
y (19)
Remark: The Ridge regression.
29 / 66
Gaussian Noise
We assume that the training set is contaminated by additive white
Gaussian Noise
yi = f (xi, β) + ωi (17)
for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean
Gaussian samples with variance σ2
Thus, for [y1, ..., yN ], we have the following likelihood
p (y|β) = N Hβ, σ2
I (18)
The posterior p (β|y) is still Gaussian and the mode is given by
β = σ2
I + HT
H
−1
HT
y (19)
Remark: The Ridge regression.
29 / 66
Gaussian Noise
We assume that the training set is contaminated by additive white
Gaussian Noise
yi = f (xi, β) + ωi (17)
for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean
Gaussian samples with variance σ2
Thus, for [y1, ..., yN ], we have the following likelihood
p (y|β) = N Hβ, σ2
I (18)
The posterior p (β|y) is still Gaussian and the mode is given by
β = σ2
I + HT
H
−1
HT
y (19)
Remark: The Ridge regression.
29 / 66
Regression with a Laplacian Prior
In order to favor sparse estimate
p (β|α) =
k
i=1
α
2
exp {−α |βi|} =
α
2
k
exp {−α β 1} (20)
Thus, the MAP estimate of β look like
β = argmin
β
Hβ − y 2
2 + 2σ2
α β 1 (21)
30 / 66
Regression with a Laplacian Prior
In order to favor sparse estimate
p (β|α) =
k
i=1
α
2
exp {−α |βi|} =
α
2
k
exp {−α β 1} (20)
Thus, the MAP estimate of β look like
β = argmin
β
Hβ − y 2
2 + 2σ2
α β 1 (21)
30 / 66
Remark
This criterion is know as the LASSO
This norm l1 induces sparsity in the weight terms.
How?
For example, [1, 0]T
2
= [1/
√
2, 1/
√
2]T
2
= 1.
In the other case, [1, 0]T
1
= 1 < [1/
√
2, 1/
√
2]T
2
=
√
2.
31 / 66
Remark
This criterion is know as the LASSO
This norm l1 induces sparsity in the weight terms.
How?
For example, [1, 0]T
2
= [1/
√
2, 1/
√
2]T
2
= 1.
In the other case, [1, 0]T
1
= 1 < [1/
√
2, 1/
√
2]T
2
=
√
2.
31 / 66
Remark
This criterion is know as the LASSO
This norm l1 induces sparsity in the weight terms.
How?
For example, [1, 0]T
2
= [1/
√
2, 1/
√
2]T
2
= 1.
In the other case, [1, 0]T
1
= 1 < [1/
√
2, 1/
√
2]T
2
=
√
2.
31 / 66
An example
What if H is a orthogonal matrix
In this case HT
H = I
Thus
β =argmin
β
Hβ − y 2
2 + 2σ2
α β 1
=argmin
β
(Hβ − y)T
(Hβ − y) + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
HT
Hβ − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
β − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
32 / 66
An example
What if H is a orthogonal matrix
In this case HT
H = I
Thus
β =argmin
β
Hβ − y 2
2 + 2σ2
α β 1
=argmin
β
(Hβ − y)T
(Hβ − y) + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
HT
Hβ − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
β − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
32 / 66
An example
What if H is a orthogonal matrix
In this case HT
H = I
Thus
β =argmin
β
Hβ − y 2
2 + 2σ2
α β 1
=argmin
β
(Hβ − y)T
(Hβ − y) + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
HT
Hβ − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
β − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
32 / 66
An example
What if H is a orthogonal matrix
In this case HT
H = I
Thus
β =argmin
β
Hβ − y 2
2 + 2σ2
α β 1
=argmin
β
(Hβ − y)T
(Hβ − y) + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
HT
Hβ − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
β − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
32 / 66
An example
What if H is a orthogonal matrix
In this case HT
H = I
Thus
β =argmin
β
Hβ − y 2
2 + 2σ2
α β 1
=argmin
β
(Hβ − y)T
(Hβ − y) + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
HT
Hβ − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
=argmin
β
βT
β − 2βT
HT
y + yT
y + 2σ2
α
k
i=1
|βi|
32 / 66
We can solve this last part as follow
We can group for each βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| + y2
i (22)
If we can minimize each group we will be able to get the solution
βi = argmin
βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| (23)
We have two cases
βi > 0
βi < 0
33 / 66
We can solve this last part as follow
We can group for each βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| + y2
i (22)
If we can minimize each group we will be able to get the solution
βi = argmin
βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| (23)
We have two cases
βi > 0
βi < 0
33 / 66
We can solve this last part as follow
We can group for each βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| + y2
i (22)
If we can minimize each group we will be able to get the solution
βi = argmin
βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| (23)
We have two cases
βi > 0
βi < 0
33 / 66
We can solve this last part as follow
We can group for each βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| + y2
i (22)
If we can minimize each group we will be able to get the solution
βi = argmin
βi
β2
i − 2βi HT
y
i
+ 2σ2
α |βi| (23)
We have two cases
βi > 0
βi < 0
33 / 66
If βi > 0
We then derive with respect to βi
∂ β2
i − 2βi HT
y
i
+ 2σ2αiβi
∂βi
= 2βi − 2 HT
y
i
+ 2σ2
α
We have then
βi = HT
y
i
− σ2
α (24)
34 / 66
If βi > 0
We then derive with respect to βi
∂ β2
i − 2βi HT
y
i
+ 2σ2αiβi
∂βi
= 2βi − 2 HT
y
i
+ 2σ2
α
We have then
βi = HT
y
i
− σ2
α (24)
34 / 66
If βi < 0
We then derive with respect to βi
∂ β2
i − 2βi HT
y
i
− 2σ2αiβi
∂βi
= 2βi − 2 HT
y
i
− 2σ2
α
We have then
βi = HT
y
i
+ σ2
α (25)
35 / 66
If βi < 0
We then derive with respect to βi
∂ β2
i − 2βi HT
y
i
− 2σ2αiβi
∂βi
= 2βi − 2 HT
y
i
− 2σ2
α
We have then
βi = HT
y
i
+ σ2
α (25)
35 / 66
The value of HT
y i
Ww have that
We have that:
if βi > 0 then HT
y
i
> σ2α
if βi < 0 then HT
y
i
< −σ2α
36 / 66
We can put all this together
A compact Version
βi = sgn HT
y
i
HT
y
i
− σ2
α
+
(26)
With
(a)+ =
a if a ≥ 0
0 if a < 0
and “sgn” is the sign function.
This rule is know as the
The soft threshold!!!
37 / 66
We can put all this together
A compact Version
βi = sgn HT
y
i
HT
y
i
− σ2
α
+
(26)
With
(a)+ =
a if a ≥ 0
0 if a < 0
and “sgn” is the sign function.
This rule is know as the
The soft threshold!!!
37 / 66
We can put all this together
A compact Version
βi = sgn HT
y
i
HT
y
i
− σ2
α
+
(26)
With
(a)+ =
a if a ≥ 0
0 if a < 0
and “sgn” is the sign function.
This rule is know as the
The soft threshold!!!
37 / 66
Example
We have the doted · · · line as the soft threshold
a
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
38 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
39 / 66
Now
Given that each βi has a zero-mean Gaussian prior
p (βi|τi) = N (βi|0, τi) (27)
Where τi has the following exponential hyperprior
p (τi|γ) =
γ
2
exp −
γ
2
τi for τi ≥ 0 (28)
This is a though property - so we take it by heart
p (βi|γ) =
ˆ ∞
0
p (βi|τi) p (τi|γ) dτi =
√
γ
2
exp {−
√
γ |βi|} (29)
40 / 66
Now
Given that each βi has a zero-mean Gaussian prior
p (βi|τi) = N (βi|0, τi) (27)
Where τi has the following exponential hyperprior
p (τi|γ) =
γ
2
exp −
γ
2
τi for τi ≥ 0 (28)
This is a though property - so we take it by heart
p (βi|γ) =
ˆ ∞
0
p (βi|τi) p (τi|γ) dτi =
√
γ
2
exp {−
√
γ |βi|} (29)
40 / 66
Now
Given that each βi has a zero-mean Gaussian prior
p (βi|τi) = N (βi|0, τi) (27)
Where τi has the following exponential hyperprior
p (τi|γ) =
γ
2
exp −
γ
2
τi for τi ≥ 0 (28)
This is a though property - so we take it by heart
p (βi|γ) =
ˆ ∞
0
p (βi|τi) p (τi|γ) dτi =
√
γ
2
exp {−
√
γ |βi|} (29)
40 / 66
Example
The double exponential
41 / 66
This is equivalent to the use of the L1-norm for
regularization
L1(Left) is better for sparsity promotion than L2(Right)
42 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
43 / 66
The EM trick
How do we do this?
This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data
Then, if we could observe τ, complete log-posterior log p (β, σ2
|y, τ)
can be easily calculated
p β, σ2
|y, τ ∝ p y|β, σ2
p (β|τ) p σ2
(30)
Where
p y|β, σ2 ∼ N β, σ2
p (β|0, τ) ∼ N (0, τ)
44 / 66
The EM trick
How do we do this?
This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data
Then, if we could observe τ, complete log-posterior log p (β, σ2
|y, τ)
can be easily calculated
p β, σ2
|y, τ ∝ p y|β, σ2
p (β|τ) p σ2
(30)
Where
p y|β, σ2 ∼ N β, σ2
p (β|0, τ) ∼ N (0, τ)
44 / 66
The EM trick
How do we do this?
This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data
Then, if we could observe τ, complete log-posterior log p (β, σ2
|y, τ)
can be easily calculated
p β, σ2
|y, τ ∝ p y|β, σ2
p (β|τ) p σ2
(30)
Where
p y|β, σ2 ∼ N β, σ2
p (β|0, τ) ∼ N (0, τ)
44 / 66
What about p (σ2
)?
We select
p σ2 as a constant
However
We can adopt a conjugate inverse Gamma prior for σ2, but for large
number of samples the prior on the estimate of σ2is very small.
In the constant case
We can use the MAP idea, however we have hidden parameters so we
resort to the EM
45 / 66
What about p (σ2
)?
We select
p σ2 as a constant
However
We can adopt a conjugate inverse Gamma prior for σ2, but for large
number of samples the prior on the estimate of σ2is very small.
In the constant case
We can use the MAP idea, however we have hidden parameters so we
resort to the EM
45 / 66
What about p (σ2
)?
We select
p σ2 as a constant
However
We can adopt a conjugate inverse Gamma prior for σ2, but for large
number of samples the prior on the estimate of σ2is very small.
In the constant case
We can use the MAP idea, however we have hidden parameters so we
resort to the EM
45 / 66
E-step
Computes the expected value of the complete log-posterior
Q β, σ2
|β(t), σ2
(t) =
ˆ
log p β, σ2
|y, τ p τ|β(t), σ2
(t), y dτ (31)
46 / 66
M-step
Updates the parameter estimates by maximizing the Q-function
β(t+1), σ2
(t+1) = argmax
β,σ2
Q β, σ2
|β(t), σ2
(t) (32)
47 / 66
Remark
First
The EM algorithm converges to a local maximum of the a posteriori
probability density function
p β, σ2
|y ∝ p y|β, σ2
p (β|γ) (33)
Without using the marginal prior p (β|γ) which is not Gaussian
Instead we use a conditional Gaussian prior p (β|γ)
48 / 66
Remark
First
The EM algorithm converges to a local maximum of the a posteriori
probability density function
p β, σ2
|y ∝ p y|β, σ2
p (β|γ) (33)
Without using the marginal prior p (β|γ) which is not Gaussian
Instead we use a conditional Gaussian prior p (β|γ)
48 / 66
The Final Model
We have
p y|β, σ2
= N Hβ, σ2
I
p σ2
∝ ”constant”
p (β|τ) =
k
i=1
N (βi|0, τi) = N β|0, (Υ (τ))−1
p (τ|γ) =
γ
2
k k
i=1
exp −
γ
2
τi
With Υ (τ) = diag τ−1
1 , ..., τ−1
k is the diagonal matrix with the inverse
variances of all the βi’s.
49 / 66
The Final Model
We have
p y|β, σ2
= N Hβ, σ2
I
p σ2
∝ ”constant”
p (β|τ) =
k
i=1
N (βi|0, τi) = N β|0, (Υ (τ))−1
p (τ|γ) =
γ
2
k k
i=1
exp −
γ
2
τi
With Υ (τ) = diag τ−1
1 , ..., τ−1
k is the diagonal matrix with the inverse
variances of all the βi’s.
49 / 66
The Final Model
We have
p y|β, σ2
= N Hβ, σ2
I
p σ2
∝ ”constant”
p (β|τ) =
k
i=1
N (βi|0, τi) = N β|0, (Υ (τ))−1
p (τ|γ) =
γ
2
k k
i=1
exp −
γ
2
τi
With Υ (τ) = diag τ−1
1 , ..., τ−1
k is the diagonal matrix with the inverse
variances of all the βi’s.
49 / 66
The Final Model
We have
p y|β, σ2
= N Hβ, σ2
I
p σ2
∝ ”constant”
p (β|τ) =
k
i=1
N (βi|0, τi) = N β|0, (Υ (τ))−1
p (τ|γ) =
γ
2
k k
i=1
exp −
γ
2
τi
With Υ (τ) = diag τ−1
1 , ..., τ−1
k is the diagonal matrix with the inverse
variances of all the βi’s.
49 / 66
The Final Model
We have
p y|β, σ2
= N Hβ, σ2
I
p σ2
∝ ”constant”
p (β|τ) =
k
i=1
N (βi|0, τi) = N β|0, (Υ (τ))−1
p (τ|γ) =
γ
2
k k
i=1
exp −
γ
2
τi
With Υ (τ) = diag τ−1
1 , ..., τ−1
k is the diagonal matrix with the inverse
variances of all the βi’s.
49 / 66
Now, we find the Q function
First
log p β, σ2
|y, τ ∝ log p y|β, σ2
+ log p (β|τ)
∝ −n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β
How can we get this?
Remember
N (y|µ, Σ) =
1
(2π)
k
2 |Σ|
1
2
exp −
1
2
(y − µ)T
Σ−1
(34)
Volunteers?
Please to the blackboard.
50 / 66
Now, we find the Q function
First
log p β, σ2
|y, τ ∝ log p y|β, σ2
+ log p (β|τ)
∝ −n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β
How can we get this?
Remember
N (y|µ, Σ) =
1
(2π)
k
2 |Σ|
1
2
exp −
1
2
(y − µ)T
Σ−1
(34)
Volunteers?
Please to the blackboard.
50 / 66
Now, we find the Q function
First
log p β, σ2
|y, τ ∝ log p y|β, σ2
+ log p (β|τ)
∝ −n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β
How can we get this?
Remember
N (y|µ, Σ) =
1
(2π)
k
2 |Σ|
1
2
exp −
1
2
(y − µ)T
Σ−1
(34)
Volunteers?
Please to the blackboard.
50 / 66
Thus
Second
Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and
the other terms do not depend on τ?
Thus, the E-step is reduced to the computation of Υ (τ)
V(t) = E Υ (τ) |y, β(t), σ2
(t)
= diag E τ−1
1 |y, β(t), σ2
(t) , ..., E τ−1
k |y, β(t), σ2
(t)
51 / 66
Thus
Second
Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and
the other terms do not depend on τ?
Thus, the E-step is reduced to the computation of Υ (τ)
V(t) = E Υ (τ) |y, β(t), σ2
(t)
= diag E τ−1
1 |y, β(t), σ2
(t) , ..., E τ−1
k |y, β(t), σ2
(t)
51 / 66
Thus
Second
Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and
the other terms do not depend on τ?
Thus, the E-step is reduced to the computation of Υ (τ)
V(t) = E Υ (τ) |y, β(t), σ2
(t)
= diag E τ−1
1 |y, β(t), σ2
(t) , ..., E τ−1
k |y, β(t), σ2
(t)
51 / 66
Now
What do we need to calculate each of this expectations?
p τi|y, β(t), σ2
(t) = p τi|y, βi,(t), σ2
i,(t) (35)
Then
p τi|y, βi,(t), σ2
i,(t) =
p τi, βi,(t)|y, σ2
i,(t)
p βi,(t)|y, σ2
i,(t)
∝ p βi,(t)|τi, y, σ2
i,(t) p τi|y, σ2
i,(t)
= p βi,(t)|τi p (τi)
52 / 66
Now
What do we need to calculate each of this expectations?
p τi|y, β(t), σ2
(t) = p τi|y, βi,(t), σ2
i,(t) (35)
Then
p τi|y, βi,(t), σ2
i,(t) =
p τi, βi,(t)|y, σ2
i,(t)
p βi,(t)|y, σ2
i,(t)
∝ p βi,(t)|τi, y, σ2
i,(t) p τi|y, σ2
i,(t)
= p βi,(t)|τi p (τi)
52 / 66
Now
What do we need to calculate each of this expectations?
p τi|y, β(t), σ2
(t) = p τi|y, βi,(t), σ2
i,(t) (35)
Then
p τi|y, βi,(t), σ2
i,(t) =
p τi, βi,(t)|y, σ2
i,(t)
p βi,(t)|y, σ2
i,(t)
∝ p βi,(t)|τi, y, σ2
i,(t) p τi|y, σ2
i,(t)
= p βi,(t)|τi p (τi)
52 / 66
Now
What do we need to calculate each of this expectations?
p τi|y, β(t), σ2
(t) = p τi|y, βi,(t), σ2
i,(t) (35)
Then
p τi|y, βi,(t), σ2
i,(t) =
p τi, βi,(t)|y, σ2
i,(t)
p βi,(t)|y, σ2
i,(t)
∝ p βi,(t)|τi, y, σ2
i,(t) p τi|y, σ2
i,(t)
= p βi,(t)|τi p (τi)
52 / 66
Here is the interesting part
We have the following probability density
p τi|y, βi,(t), σ2
i,(t) =
N βi,(t)|0, τi
γ
2 exp −γ
2 τi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(36)
Then
E τ−1
1 |y, βi,(t), σ2
(t) =
´ ∞
0
1
τi
N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(37)
Now
I leave to you to prove that (It can come in the test)
53 / 66
Here is the interesting part
We have the following probability density
p τi|y, βi,(t), σ2
i,(t) =
N βi,(t)|0, τi
γ
2 exp −γ
2 τi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(36)
Then
E τ−1
1 |y, βi,(t), σ2
(t) =
´ ∞
0
1
τi
N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(37)
Now
I leave to you to prove that (It can come in the test)
53 / 66
Here is the interesting part
We have the following probability density
p τi|y, βi,(t), σ2
i,(t) =
N βi,(t)|0, τi
γ
2 exp −γ
2 τi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(36)
Then
E τ−1
1 |y, βi,(t), σ2
(t) =
´ ∞
0
1
τi
N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
´ ∞
0 N βi,(t)|0, τi
γ
2 exp −γ
2 τi dτi
(37)
Now
I leave to you to prove that (It can come in the test)
53 / 66
Here is the interesting part
Thus
E τ−1
1 |y, βi,(t), σ2
(t) =
γ
βi,(t)
(38)
Finally
V(t) = γdiag β1,(t)
−1
, ..., βk,(t)
−1
(39)
54 / 66
Here is the interesting part
Thus
E τ−1
1 |y, βi,(t), σ2
(t) =
γ
βi,(t)
(38)
Finally
V(t) = γdiag β1,(t)
−1
, ..., βk,(t)
−1
(39)
54 / 66
The Final Q function
Something Notable
Q β, σ2
|β(t), σ2
(t) =
ˆ
log p β, σ2
|y, τ p τ|β(t), σ2
(t), y dτ
=
ˆ
−n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β p τ|β(t), σ2
(t), y dτ
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
ˆ
Υ (τ) p τ|β(t), σ2
(t), y dτ β
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
V(t)β
55 / 66
The Final Q function
Something Notable
Q β, σ2
|β(t), σ2
(t) =
ˆ
log p β, σ2
|y, τ p τ|β(t), σ2
(t), y dτ
=
ˆ
−n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β p τ|β(t), σ2
(t), y dτ
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
ˆ
Υ (τ) p τ|β(t), σ2
(t), y dτ β
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
V(t)β
55 / 66
The Final Q function
Something Notable
Q β, σ2
|β(t), σ2
(t) =
ˆ
log p β, σ2
|y, τ p τ|β(t), σ2
(t), y dτ
=
ˆ
−n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β p τ|β(t), σ2
(t), y dτ
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
ˆ
Υ (τ) p τ|β(t), σ2
(t), y dτ β
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
V(t)β
55 / 66
The Final Q function
Something Notable
Q β, σ2
|β(t), σ2
(t) =
ˆ
log p β, σ2
|y, τ p τ|β(t), σ2
(t), y dτ
=
ˆ
−n log σ2
−
y − Hβ 2
2
σ2
− βT
Υ (τ) β p τ|β(t), σ2
(t), y dτ
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
ˆ
Υ (τ) p τ|β(t), σ2
(t), y dτ β
= −n log σ2
−
y − Hβ 2
2
σ2
− βT
V(t)β
55 / 66
Finally, the M-step
First
σ2
(t+1) = argmax
σ2
−n log σ2
−
y − Hβ 2
2
σ2
=
y − Hβ 2
2
n
Second
β(t+1) = argmax
β
−
y − Hβ 2
2
σ2
− βT
V(t)β
= σ2
(t+1)V(t) + HT
H
−1
HT
y
This also I leave to you
It can come in the test.
56 / 66
Finally, the M-step
First
σ2
(t+1) = argmax
σ2
−n log σ2
−
y − Hβ 2
2
σ2
=
y − Hβ 2
2
n
Second
β(t+1) = argmax
β
−
y − Hβ 2
2
σ2
− βT
V(t)β
= σ2
(t+1)V(t) + HT
H
−1
HT
y
This also I leave to you
It can come in the test.
56 / 66
Finally, the M-step
First
σ2
(t+1) = argmax
σ2
−n log σ2
−
y − Hβ 2
2
σ2
=
y − Hβ 2
2
n
Second
β(t+1) = argmax
β
−
y − Hβ 2
2
σ2
− βT
V(t)β
= σ2
(t+1)V(t) + HT
H
−1
HT
y
This also I leave to you
It can come in the test.
56 / 66
Finally, the M-step
First
σ2
(t+1) = argmax
σ2
−n log σ2
−
y − Hβ 2
2
σ2
=
y − Hβ 2
2
n
Second
β(t+1) = argmax
β
−
y − Hβ 2
2
σ2
− βT
V(t)β
= σ2
(t+1)V(t) + HT
H
−1
HT
y
This also I leave to you
It can come in the test.
56 / 66
Finally, the M-step
First
σ2
(t+1) = argmax
σ2
−n log σ2
−
y − Hβ 2
2
σ2
=
y − Hβ 2
2
n
Second
β(t+1) = argmax
β
−
y − Hβ 2
2
σ2
− βT
V(t)β
= σ2
(t+1)V(t) + HT
H
−1
HT
y
This also I leave to you
It can come in the test.
56 / 66
Outline
1 Introduction
A first solution
Example
Properties of the MAP
2 Example of Application of MAP and EM
Example
Linear Regression
The Gaussian Noise
A Hierarchical-Bayes View of the Laplacian Prior
Sparse Regression via EM
Jeffrey’s Prior
57 / 66
However
We need to deal in some way with the γ term
It controls the degree of spareness!!!
We can do assuming a Jeffrey’s Prior
J. Berger, Statistical Decision Theory and Bayesian Analysis. New York:
Springer-Verlag, 1980.
We use instead
p (τ) ∝
1
τ
(40)
58 / 66
However
We need to deal in some way with the γ term
It controls the degree of spareness!!!
We can do assuming a Jeffrey’s Prior
J. Berger, Statistical Decision Theory and Bayesian Analysis. New York:
Springer-Verlag, 1980.
We use instead
p (τ) ∝
1
τ
(40)
58 / 66
However
We need to deal in some way with the γ term
It controls the degree of spareness!!!
We can do assuming a Jeffrey’s Prior
J. Berger, Statistical Decision Theory and Bayesian Analysis. New York:
Springer-Verlag, 1980.
We use instead
p (τ) ∝
1
τ
(40)
58 / 66
Properties of the Jeffrey’s Prior
Important
This prior expresses ignorance with respect to scale and is parameter free
Why scale invariant
Imagine, we change the scale of τ by τ = Kτ where K is a constant
expressing that change
Thus, we have that
p τ =
1
τ
=
1
kτ
∝
1
τ
(41)
59 / 66
Properties of the Jeffrey’s Prior
Important
This prior expresses ignorance with respect to scale and is parameter free
Why scale invariant
Imagine, we change the scale of τ by τ = Kτ where K is a constant
expressing that change
Thus, we have that
p τ =
1
τ
=
1
kτ
∝
1
τ
(41)
59 / 66
Properties of the Jeffrey’s Prior
Important
This prior expresses ignorance with respect to scale and is parameter free
Why scale invariant
Imagine, we change the scale of τ by τ = Kτ where K is a constant
expressing that change
Thus, we have that
p τ =
1
τ
=
1
kτ
∝
1
τ
(41)
59 / 66
However
Something Notable
This prior is known as an improper prior.
In addition
This prior does not leads to a Laplacian prior on β.
Nevertheless
This prior induces sparseness and good performance for the β.
60 / 66
However
Something Notable
This prior is known as an improper prior.
In addition
This prior does not leads to a Laplacian prior on β.
Nevertheless
This prior induces sparseness and good performance for the β.
60 / 66
However
Something Notable
This prior is known as an improper prior.
In addition
This prior does not leads to a Laplacian prior on β.
Nevertheless
This prior induces sparseness and good performance for the β.
60 / 66
Introducing this prior into the equations
Matrix V(t) is now
V(t) = diag β1,(t)
−2
, ..., βk,(t)
−2
(42)
Quite interesting!!!
We do not have the free γ parameter.
61 / 66
Introducing this prior into the equations
Matrix V(t) is now
V(t) = diag β1,(t)
−2
, ..., βk,(t)
−2
(42)
Quite interesting!!!
We do not have the free γ parameter.
61 / 66
Here, we can see the new threshold
With dotted line as the soft threshold, dashed the hard threshold and
in blue solid the Jeffrey’s prior
a
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
62 / 66
Observations
The new rule is between
The soft threshold rule.
The hard threshold rule.
Something Notable
With large values of HT
y
i
the new rule approaches the hard threshold.
Once HT
y
i
gets smaller
The estimate becomes progressively smaller approaching the behavior of
the soft rule.
63 / 66
Observations
The new rule is between
The soft threshold rule.
The hard threshold rule.
Something Notable
With large values of HT
y
i
the new rule approaches the hard threshold.
Once HT
y
i
gets smaller
The estimate becomes progressively smaller approaching the behavior of
the soft rule.
63 / 66
Observations
The new rule is between
The soft threshold rule.
The hard threshold rule.
Something Notable
With large values of HT
y
i
the new rule approaches the hard threshold.
Once HT
y
i
gets smaller
The estimate becomes progressively smaller approaching the behavior of
the soft rule.
63 / 66
Finally, an implementation detail
Since several elements of β will go to zero
V(t) = diag β1,(t)
−2
, ..., βk,(t)
−2
will have several elements going to
large numbers
Something Notable
if we define U(t) = diag β1,(t) , ..., βk,(t) .
Then, we have that
V(t) = U−1
(t) U−1
(t) (43)
64 / 66
Finally, an implementation detail
Since several elements of β will go to zero
V(t) = diag β1,(t)
−2
, ..., βk,(t)
−2
will have several elements going to
large numbers
Something Notable
if we define U(t) = diag β1,(t) , ..., βk,(t) .
Then, we have that
V(t) = U−1
(t) U−1
(t) (43)
64 / 66
Finally, an implementation detail
Since several elements of β will go to zero
V(t) = diag β1,(t)
−2
, ..., βk,(t)
−2
will have several elements going to
large numbers
Something Notable
if we define U(t) = diag β1,(t) , ..., βk,(t) .
Then, we have that
V(t) = U−1
(t) U−1
(t) (43)
64 / 66
Thus
We have that
β(t+1) = σ2
(t+1)V(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) U−1
(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + IHT
HI
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + U−1
(t) U(t)HT
HU(t)U−1
(t)
−1
HT
y
= U(t) σ2
(t+1)I + U(t)HT
HU(t)
−1
U(t)HT
y
65 / 66
Thus
We have that
β(t+1) = σ2
(t+1)V(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) U−1
(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + IHT
HI
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + U−1
(t) U(t)HT
HU(t)U−1
(t)
−1
HT
y
= U(t) σ2
(t+1)I + U(t)HT
HU(t)
−1
U(t)HT
y
65 / 66
Thus
We have that
β(t+1) = σ2
(t+1)V(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) U−1
(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + IHT
HI
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + U−1
(t) U(t)HT
HU(t)U−1
(t)
−1
HT
y
= U(t) σ2
(t+1)I + U(t)HT
HU(t)
−1
U(t)HT
y
65 / 66
Thus
We have that
β(t+1) = σ2
(t+1)V(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) U−1
(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + IHT
HI
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + U−1
(t) U(t)HT
HU(t)U−1
(t)
−1
HT
y
= U(t) σ2
(t+1)I + U(t)HT
HU(t)
−1
U(t)HT
y
65 / 66
Thus
We have that
β(t+1) = σ2
(t+1)V(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) U−1
(t) + HT
H
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + IHT
HI
−1
HT
y
= σ2
(t+1)U−1
(t) IU−1
(t) + U−1
(t) U(t)HT
HU(t)U−1
(t)
−1
HT
y
= U(t) σ2
(t+1)I + U(t)HT
HU(t)
−1
U(t)HT
y
65 / 66
Advantages!!!
Quite Important
We avoid the inversion of the elements of β(t).
We can avoid getting the inverse matrix
We simply solve the corresponding linear system whose dimension is only
the number of nonzero elements in U(t). Why?
Remember you want to maximize
Q β, σ2|β(t), σ2
(t) = −n log σ2 −
y−Hβ 2
2
σ2 − βT V(t)β
66 / 66
Advantages!!!
Quite Important
We avoid the inversion of the elements of β(t).
We can avoid getting the inverse matrix
We simply solve the corresponding linear system whose dimension is only
the number of nonzero elements in U(t). Why?
Remember you want to maximize
Q β, σ2|β(t), σ2
(t) = −n log σ2 −
y−Hβ 2
2
σ2 − βT V(t)β
66 / 66
Advantages!!!
Quite Important
We avoid the inversion of the elements of β(t).
We can avoid getting the inverse matrix
We simply solve the corresponding linear system whose dimension is only
the number of nonzero elements in U(t). Why?
Remember you want to maximize
Q β, σ2|β(t), σ2
(t) = −n log σ2 −
y−Hβ 2
2
σ2 − βT V(t)β
66 / 66

More Related Content

What's hot

Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and RAkhilesh Joshi
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theorysia16
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAmrinder Arora
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Simplilearn
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionalityNikhil Sharma
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Singular value decomposition (SVD)
Singular value decomposition (SVD)Singular value decomposition (SVD)
Singular value decomposition (SVD)Luis Serrano
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1Amrinder Arora
 

What's hot (20)

Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
Expectation maximization
Expectation maximizationExpectation maximization
Expectation maximization
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data Structures
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Singular value decomposition (SVD)
Singular value decomposition (SVD)Singular value decomposition (SVD)
Singular value decomposition (SVD)
 
Best,worst,average case .17581556 045
Best,worst,average case .17581556 045Best,worst,average case .17581556 045
Best,worst,average case .17581556 045
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1
 

Viewers also liked

07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear RegressionJungkyu Lee
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1Serhii Havrylov
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2Serhii Havrylov
 

Viewers also liked (6)

07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear Regression
 
(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1(Kpi summer school 2015) theano tutorial part1
(Kpi summer school 2015) theano tutorial part1
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2
 

Similar to 08 Machine Learning Maximum Aposteriori

Otter 2016-11-28-01-ss
Otter 2016-11-28-01-ssOtter 2016-11-28-01-ss
Otter 2016-11-28-01-ssRuo Ando
 
Probability Theory.pdf
Probability Theory.pdfProbability Theory.pdf
Probability Theory.pdfcodewithmensah
 
5.3 ppt stats
5.3 ppt stats5.3 ppt stats
5.3 ppt statscahd7767
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
Otter 2014-12-22-01-slideshare
Otter 2014-12-22-01-slideshareOtter 2014-12-22-01-slideshare
Otter 2014-12-22-01-slideshareRuo Ando
 
Machine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamMachine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamYusufFakhriAldrian1
 
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...Camilo Pechâ Garz
 
Formal methods 5 - Pi calculus
Formal methods   5 - Pi calculusFormal methods   5 - Pi calculus
Formal methods 5 - Pi calculusVlad Patryshev
 
UNIT-III-PPT.pptx
UNIT-III-PPT.pptxUNIT-III-PPT.pptx
UNIT-III-PPT.pptxDakshBaveja
 
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...Lawrence Livermore National Laboratory
 
SPIE Conference V3.0
SPIE Conference V3.0SPIE Conference V3.0
SPIE Conference V3.0Robert Fry
 
Non Standard Logics & Modal Logics
Non Standard Logics & Modal LogicsNon Standard Logics & Modal Logics
Non Standard Logics & Modal LogicsSerge Garlatti
 
Mathematical Logic - Part 1
Mathematical Logic - Part 1Mathematical Logic - Part 1
Mathematical Logic - Part 1blaircomp2003
 

Similar to 08 Machine Learning Maximum Aposteriori (20)

Otter 2016-11-28-01-ss
Otter 2016-11-28-01-ssOtter 2016-11-28-01-ss
Otter 2016-11-28-01-ss
 
Week-4.pdf
Week-4.pdfWeek-4.pdf
Week-4.pdf
 
Probability Theory.pdf
Probability Theory.pdfProbability Theory.pdf
Probability Theory.pdf
 
Probability QA 7
Probability QA 7Probability QA 7
Probability QA 7
 
HPWFcorePRES--FUR2016
HPWFcorePRES--FUR2016HPWFcorePRES--FUR2016
HPWFcorePRES--FUR2016
 
Probability QA 8
Probability QA 8Probability QA 8
Probability QA 8
 
5.3 ppt stats
5.3 ppt stats5.3 ppt stats
5.3 ppt stats
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
Otter 2014-12-22-01-slideshare
Otter 2014-12-22-01-slideshareOtter 2014-12-22-01-slideshare
Otter 2014-12-22-01-slideshare
 
Machine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamMachine Learning Printable For Studying Exam
Machine Learning Printable For Studying Exam
 
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
An Extension of Downs Model of Political Competition using Fuzzy Logic (Soci...
 
PU Learning
PU LearningPU Learning
PU Learning
 
Uncertainty
UncertaintyUncertainty
Uncertainty
 
AI Lesson 17
AI Lesson 17AI Lesson 17
AI Lesson 17
 
Formal methods 5 - Pi calculus
Formal methods   5 - Pi calculusFormal methods   5 - Pi calculus
Formal methods 5 - Pi calculus
 
UNIT-III-PPT.pptx
UNIT-III-PPT.pptxUNIT-III-PPT.pptx
UNIT-III-PPT.pptx
 
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...
All About that Bayes: Probability, Statistics, and the Quest to Quantify Unce...
 
SPIE Conference V3.0
SPIE Conference V3.0SPIE Conference V3.0
SPIE Conference V3.0
 
Non Standard Logics & Modal Logics
Non Standard Logics & Modal LogicsNon Standard Logics & Modal Logics
Non Standard Logics & Modal Logics
 
Mathematical Logic - Part 1
Mathematical Logic - Part 1Mathematical Logic - Part 1
Mathematical Logic - Part 1
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 

Recently uploaded (20)

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 

08 Machine Learning Maximum Aposteriori

  • 1. Machine Learning for Data Mining Maximum A Posteriori (MAP) Andres Mendez-Vazquez July 22, 2015 1 / 66
  • 2. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 2 / 66
  • 3. Introduction We go back to the Bayesian Rule p (Θ|X) = p (X|Θ) p (Θ) p (X) (1) We now seek that value for Θ, called ΘMAP It allows to maximize the posterior p (Θ|X) 3 / 66
  • 4. Introduction We go back to the Bayesian Rule p (Θ|X) = p (X|Θ) p (Θ) p (X) (1) We now seek that value for Θ, called ΘMAP It allows to maximize the posterior p (Θ|X) 3 / 66
  • 5. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 4 / 66
  • 6. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66
  • 7. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66
  • 8. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66
  • 9. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66
  • 10. Development of the solution We look to maximize ΘMAP ΘMAP = argmax Θ p (Θ) = argmax Θ p (X|Θ) p (Θ) P (X) ≈ argmax Θ p (X|Θ) p (Θ) = argmax Θ xi∈X p (xi|Θ) p (Θ) P (X) can be removed because it has no functional relation with Θ. 5 / 66
  • 11. We can make this easier Use logarithms ΘMAP = argmax Θ   xi∈X log p (xi|Θ) + log p (Θ)   (2) 6 / 66
  • 12. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 7 / 66
  • 13. What Does the MAP Estimate Get Something Notable The MAP estimate allows us to inject into the estimation calculation our prior beliefs regarding the parameters values in Θ. For example Let’s conduct N independent trials of the following Bernoulli experiment with q parameter: We will ask each individual we run into in the hallway whether they will vote PRI or PAN in the next presidential election. With probability p to vote PRI Where the values of xi is either PRI or PAN. 8 / 66
  • 14. What Does the MAP Estimate Get Something Notable The MAP estimate allows us to inject into the estimation calculation our prior beliefs regarding the parameters values in Θ. For example Let’s conduct N independent trials of the following Bernoulli experiment with q parameter: We will ask each individual we run into in the hallway whether they will vote PRI or PAN in the next presidential election. With probability p to vote PRI Where the values of xi is either PRI or PAN. 8 / 66
  • 15. What Does the MAP Estimate Get Something Notable The MAP estimate allows us to inject into the estimation calculation our prior beliefs regarding the parameters values in Θ. For example Let’s conduct N independent trials of the following Bernoulli experiment with q parameter: We will ask each individual we run into in the hallway whether they will vote PRI or PAN in the next presidential election. With probability p to vote PRI Where the values of xi is either PRI or PAN. 8 / 66
  • 16. What Does the MAP Estimate Get Something Notable The MAP estimate allows us to inject into the estimation calculation our prior beliefs regarding the parameters values in Θ. For example Let’s conduct N independent trials of the following Bernoulli experiment with q parameter: We will ask each individual we run into in the hallway whether they will vote PRI or PAN in the next presidential election. With probability p to vote PRI Where the values of xi is either PRI or PAN. 8 / 66
  • 17. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66
  • 18. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66
  • 19. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66
  • 20. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66
  • 21. First the ML estimate Samples X = xi = PAN PRI i = 1, ..., N (3) The log likelihood function log P (X|p) = N i=1 log p (xi|q) = i log p (xi = PRI|q) + ... i log p (xi = PAN|1 − q) =nPRI log q + (N − nPRI ) log (1 − q) Where nPRI are the numbers of individuals who are planning to vote PRI this fall 9 / 66
  • 22. We use our classic tricks By setting L = log P (X|q) (4) We have that ∂L ∂q = 0 (5) Thus nPRI q − (N − nPRI ) (1 − q) = 0 (6) 10 / 66
  • 23. We use our classic tricks By setting L = log P (X|q) (4) We have that ∂L ∂q = 0 (5) Thus nPRI q − (N − nPRI ) (1 − q) = 0 (6) 10 / 66
  • 24. We use our classic tricks By setting L = log P (X|q) (4) We have that ∂L ∂q = 0 (5) Thus nPRI q − (N − nPRI ) (1 − q) = 0 (6) 10 / 66
  • 25. Final Solution of ML We get qPRI = nPRI N (7) Thus If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6. 11 / 66
  • 26. Final Solution of ML We get qPRI = nPRI N (7) Thus If we say that N = 20 and if 12 are going to vote PRI, we get qPRI = 0.6. 11 / 66
  • 27. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66
  • 28. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66
  • 29. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66
  • 30. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66
  • 31. Building the MAP estimate Obviously we need a prior belief distribution We have the following constraints: The prior for q must be zero outside the [0,1] interval. Within the [0,1] interval, we are free to specify our beliefs in any way we wish. In most cases, we would want to choose a distribution for the prior beliefs that peaks somewhere in the [0, 1] interval. We assume the following The state of Colima has traditionally voted PRI in presidential elections. However, on account of the prevailing economic conditions, the voters are more likely to vote PAN in the election in question. 12 / 66
  • 32. What prior distribution can we use? We could use a Beta distribution being parametrized by two values α and β P (q) = 1 B (α, β) qα−1 (1 − q)β−1 . (8) Where We have B (α, β) = Γ(α)Γ(β) Γ(α+β) is the beta function where Γ is the generalization of the notion of factorial in the case of the real numbers. Properties When both the α, β > 0 then the beta distribution has its mode (Maximum value) at α − 1 α + β − 2 . (9) 13 / 66
  • 33. What prior distribution can we use? We could use a Beta distribution being parametrized by two values α and β P (q) = 1 B (α, β) qα−1 (1 − q)β−1 . (8) Where We have B (α, β) = Γ(α)Γ(β) Γ(α+β) is the beta function where Γ is the generalization of the notion of factorial in the case of the real numbers. Properties When both the α, β > 0 then the beta distribution has its mode (Maximum value) at α − 1 α + β − 2 . (9) 13 / 66
  • 34. What prior distribution can we use? We could use a Beta distribution being parametrized by two values α and β P (q) = 1 B (α, β) qα−1 (1 − q)β−1 . (8) Where We have B (α, β) = Γ(α)Γ(β) Γ(α+β) is the beta function where Γ is the generalization of the notion of factorial in the case of the real numbers. Properties When both the α, β > 0 then the beta distribution has its mode (Maximum value) at α − 1 α + β − 2 . (9) 13 / 66
  • 35. We then do the following We do the following We can choose α = β so the beta prior peaks at 0.5. As a further expression of our belief We make the following choice α = β = 5. Why? Look at the variance of the beta distribution αβ (α + β) (α + β + 1) . (10) 14 / 66
  • 36. We then do the following We do the following We can choose α = β so the beta prior peaks at 0.5. As a further expression of our belief We make the following choice α = β = 5. Why? Look at the variance of the beta distribution αβ (α + β) (α + β + 1) . (10) 14 / 66
  • 37. We then do the following We do the following We can choose α = β so the beta prior peaks at 0.5. As a further expression of our belief We make the following choice α = β = 5. Why? Look at the variance of the beta distribution αβ (α + β) (α + β + 1) . (10) 14 / 66
  • 38. Thus, we have the following nice properties We have a variance with α = β = 5 Var (q) ≈ 0.025 Thus, the standard deviation sd ≈ 0.16 which is a nice dispersion at the peak point!!! 15 / 66
  • 39. Thus, we have the following nice properties We have a variance with α = β = 5 Var (q) ≈ 0.025 Thus, the standard deviation sd ≈ 0.16 which is a nice dispersion at the peak point!!! 15 / 66
  • 40. Now, our MAP estimate for p... We have then pMAP = argmax Θ   xi∈X log p (xi|q) + log p (q)   (11) Plugging back the ML pMAP = argmax Θ [nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12) Where log P (p) = log 1 B (α, β) qα−1 (1 − q)β−1 (13) 16 / 66
  • 41. Now, our MAP estimate for p... We have then pMAP = argmax Θ   xi∈X log p (xi|q) + log p (q)   (11) Plugging back the ML pMAP = argmax Θ [nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12) Where log P (p) = log 1 B (α, β) qα−1 (1 − q)β−1 (13) 16 / 66
  • 42. Now, our MAP estimate for p... We have then pMAP = argmax Θ   xi∈X log p (xi|q) + log p (q)   (11) Plugging back the ML pMAP = argmax Θ [nPRI log q + (N − nPRI ) log (1 − q) + log p (q)] (12) Where log P (p) = log 1 B (α, β) qα−1 (1 − q)β−1 (13) 16 / 66
  • 43. The log of P (p) We have that log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14) Now taking the derivative with respect to p, we get nPRI q − (N − nPRI ) (1 − q) − β − 1 1 − q + α − 1 q = 0 (15) Thus qMAP = nPRI + α − 1 N + α + β − 2 (16) 17 / 66
  • 44. The log of P (p) We have that log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14) Now taking the derivative with respect to p, we get nPRI q − (N − nPRI ) (1 − q) − β − 1 1 − q + α − 1 q = 0 (15) Thus qMAP = nPRI + α − 1 N + α + β − 2 (16) 17 / 66
  • 45. The log of P (p) We have that log P (q) = (α − 1) log q + (β − 1) log (1 − q) − log B (α, β) (14) Now taking the derivative with respect to p, we get nPRI q − (N − nPRI ) (1 − q) − β − 1 1 − q + α − 1 q = 0 (15) Thus qMAP = nPRI + α − 1 N + α + β − 2 (16) 17 / 66
  • 46. Now With N = 20 with nPRI = 12 and α = β = 5 qMAP = 0.571 18 / 66
  • 47. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 19 / 66
  • 48. Properties First MAP estimation “pulls” the estimate toward the prior. Second The more focused our prior belief, the larger the pull toward the prior. Example If α = β=equal to large value It will make the MAP estimate to move closer to the prior. 20 / 66
  • 49. Properties First MAP estimation “pulls” the estimate toward the prior. Second The more focused our prior belief, the larger the pull toward the prior. Example If α = β=equal to large value It will make the MAP estimate to move closer to the prior. 20 / 66
  • 50. Properties First MAP estimation “pulls” the estimate toward the prior. Second The more focused our prior belief, the larger the pull toward the prior. Example If α = β=equal to large value It will make the MAP estimate to move closer to the prior. 20 / 66
  • 51. Properties Third In the expression we derived for qMAP, the parameters α and β play a “smoothing” role vis-a-vis the measurement nPRI . Fourth Since we referred to q as the parameter to be estimated, we can refer to α and β as the hyper-parameters in the estimation calculations. 21 / 66
  • 52. Properties Third In the expression we derived for qMAP, the parameters α and β play a “smoothing” role vis-a-vis the measurement nPRI . Fourth Since we referred to q as the parameter to be estimated, we can refer to α and β as the hyper-parameters in the estimation calculations. 21 / 66
  • 53. Beyond simple derivation In the previous technique We took an logarithm of the likelihoot × the prior to obtain a function that can be derived in order to obtain each of the parameters to be estimated. What if we cannot derive? For example when we have something like |θi|. We can try the following EM + MAP to be able to estimate the sought parameters. 22 / 66
  • 54. Beyond simple derivation In the previous technique We took an logarithm of the likelihoot × the prior to obtain a function that can be derived in order to obtain each of the parameters to be estimated. What if we cannot derive? For example when we have something like |θi|. We can try the following EM + MAP to be able to estimate the sought parameters. 22 / 66
  • 55. Beyond simple derivation In the previous technique We took an logarithm of the likelihoot × the prior to obtain a function that can be derived in order to obtain each of the parameters to be estimated. What if we cannot derive? For example when we have something like |θi|. We can try the following EM + MAP to be able to estimate the sought parameters. 22 / 66
  • 56. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 23 / 66
  • 57. Example This application comes from “Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003 24 / 66
  • 58. Example This application comes from “Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003 24 / 66
  • 59. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 25 / 66
  • 60. Introduction: Linear Regression with Gaussian Prior We consider regression functions that are linear with respect to the parameter vector β f (x, β) = k i=1 βih (x) = βT h (x) Where h (x) = [h1 (x) , ..., hk (x)]T is a vector of k fixed function of the input, often called features. 26 / 66
  • 61. Introduction: Linear Regression with Gaussian Prior We consider regression functions that are linear with respect to the parameter vector β f (x, β) = k i=1 βih (x) = βT h (x) Where h (x) = [h1 (x) , ..., hk (x)]T is a vector of k fixed function of the input, often called features. 26 / 66
  • 62. Actually, it can be... Linear Regression Linear regression, in which h (x) = [1, x1, ..., xd]T i; in this case, k = d + 1. Non-Linear Regression Here, you have a fixed basis function where h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1. Kernel Regression Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T where K (x, xi) is some kernel function. 27 / 66
  • 63. Actually, it can be... Linear Regression Linear regression, in which h (x) = [1, x1, ..., xd]T i; in this case, k = d + 1. Non-Linear Regression Here, you have a fixed basis function where h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1. Kernel Regression Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T where K (x, xi) is some kernel function. 27 / 66
  • 64. Actually, it can be... Linear Regression Linear regression, in which h (x) = [1, x1, ..., xd]T i; in this case, k = d + 1. Non-Linear Regression Here, you have a fixed basis function where h (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1. Kernel Regression Here h (x) = [1, K (x, x1) , K (x, x2) , ..., K (x, xn)]T where K (x, xi) is some kernel function. 27 / 66
  • 65. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 28 / 66
  • 66. Gaussian Noise We assume that the training set is contaminated by additive white Gaussian Noise yi = f (xi, β) + ωi (17) for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean Gaussian samples with variance σ2 Thus, for [y1, ..., yN ], we have the following likelihood p (y|β) = N Hβ, σ2 I (18) The posterior p (β|y) is still Gaussian and the mode is given by β = σ2 I + HT H −1 HT y (19) Remark: The Ridge regression. 29 / 66
  • 67. Gaussian Noise We assume that the training set is contaminated by additive white Gaussian Noise yi = f (xi, β) + ωi (17) for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean Gaussian samples with variance σ2 Thus, for [y1, ..., yN ], we have the following likelihood p (y|β) = N Hβ, σ2 I (18) The posterior p (β|y) is still Gaussian and the mode is given by β = σ2 I + HT H −1 HT y (19) Remark: The Ridge regression. 29 / 66
  • 68. Gaussian Noise We assume that the training set is contaminated by additive white Gaussian Noise yi = f (xi, β) + ωi (17) for i = 1, ..., N where [ω1, ..., ωN ] is a set of independent zero-mean Gaussian samples with variance σ2 Thus, for [y1, ..., yN ], we have the following likelihood p (y|β) = N Hβ, σ2 I (18) The posterior p (β|y) is still Gaussian and the mode is given by β = σ2 I + HT H −1 HT y (19) Remark: The Ridge regression. 29 / 66
  • 69. Regression with a Laplacian Prior In order to favor sparse estimate p (β|α) = k i=1 α 2 exp {−α |βi|} = α 2 k exp {−α β 1} (20) Thus, the MAP estimate of β look like β = argmin β Hβ − y 2 2 + 2σ2 α β 1 (21) 30 / 66
  • 70. Regression with a Laplacian Prior In order to favor sparse estimate p (β|α) = k i=1 α 2 exp {−α |βi|} = α 2 k exp {−α β 1} (20) Thus, the MAP estimate of β look like β = argmin β Hβ − y 2 2 + 2σ2 α β 1 (21) 30 / 66
  • 71. Remark This criterion is know as the LASSO This norm l1 induces sparsity in the weight terms. How? For example, [1, 0]T 2 = [1/ √ 2, 1/ √ 2]T 2 = 1. In the other case, [1, 0]T 1 = 1 < [1/ √ 2, 1/ √ 2]T 2 = √ 2. 31 / 66
  • 72. Remark This criterion is know as the LASSO This norm l1 induces sparsity in the weight terms. How? For example, [1, 0]T 2 = [1/ √ 2, 1/ √ 2]T 2 = 1. In the other case, [1, 0]T 1 = 1 < [1/ √ 2, 1/ √ 2]T 2 = √ 2. 31 / 66
  • 73. Remark This criterion is know as the LASSO This norm l1 induces sparsity in the weight terms. How? For example, [1, 0]T 2 = [1/ √ 2, 1/ √ 2]T 2 = 1. In the other case, [1, 0]T 1 = 1 < [1/ √ 2, 1/ √ 2]T 2 = √ 2. 31 / 66
  • 74. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66
  • 75. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66
  • 76. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66
  • 77. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66
  • 78. An example What if H is a orthogonal matrix In this case HT H = I Thus β =argmin β Hβ − y 2 2 + 2σ2 α β 1 =argmin β (Hβ − y)T (Hβ − y) + 2σ2 α k i=1 |βi| =argmin β βT HT Hβ − 2βT HT y + yT y + 2σ2 α k i=1 |βi| =argmin β βT β − 2βT HT y + yT y + 2σ2 α k i=1 |βi| 32 / 66
  • 79. We can solve this last part as follow We can group for each βi β2 i − 2βi HT y i + 2σ2 α |βi| + y2 i (22) If we can minimize each group we will be able to get the solution βi = argmin βi β2 i − 2βi HT y i + 2σ2 α |βi| (23) We have two cases βi > 0 βi < 0 33 / 66
  • 80. We can solve this last part as follow We can group for each βi β2 i − 2βi HT y i + 2σ2 α |βi| + y2 i (22) If we can minimize each group we will be able to get the solution βi = argmin βi β2 i − 2βi HT y i + 2σ2 α |βi| (23) We have two cases βi > 0 βi < 0 33 / 66
  • 81. We can solve this last part as follow We can group for each βi β2 i − 2βi HT y i + 2σ2 α |βi| + y2 i (22) If we can minimize each group we will be able to get the solution βi = argmin βi β2 i − 2βi HT y i + 2σ2 α |βi| (23) We have two cases βi > 0 βi < 0 33 / 66
  • 82. We can solve this last part as follow We can group for each βi β2 i − 2βi HT y i + 2σ2 α |βi| + y2 i (22) If we can minimize each group we will be able to get the solution βi = argmin βi β2 i − 2βi HT y i + 2σ2 α |βi| (23) We have two cases βi > 0 βi < 0 33 / 66
  • 83. If βi > 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i + 2σ2αiβi ∂βi = 2βi − 2 HT y i + 2σ2 α We have then βi = HT y i − σ2 α (24) 34 / 66
  • 84. If βi > 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i + 2σ2αiβi ∂βi = 2βi − 2 HT y i + 2σ2 α We have then βi = HT y i − σ2 α (24) 34 / 66
  • 85. If βi < 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i − 2σ2αiβi ∂βi = 2βi − 2 HT y i − 2σ2 α We have then βi = HT y i + σ2 α (25) 35 / 66
  • 86. If βi < 0 We then derive with respect to βi ∂ β2 i − 2βi HT y i − 2σ2αiβi ∂βi = 2βi − 2 HT y i − 2σ2 α We have then βi = HT y i + σ2 α (25) 35 / 66
  • 87. The value of HT y i Ww have that We have that: if βi > 0 then HT y i > σ2α if βi < 0 then HT y i < −σ2α 36 / 66
  • 88. We can put all this together A compact Version βi = sgn HT y i HT y i − σ2 α + (26) With (a)+ = a if a ≥ 0 0 if a < 0 and “sgn” is the sign function. This rule is know as the The soft threshold!!! 37 / 66
  • 89. We can put all this together A compact Version βi = sgn HT y i HT y i − σ2 α + (26) With (a)+ = a if a ≥ 0 0 if a < 0 and “sgn” is the sign function. This rule is know as the The soft threshold!!! 37 / 66
  • 90. We can put all this together A compact Version βi = sgn HT y i HT y i − σ2 α + (26) With (a)+ = a if a ≥ 0 0 if a < 0 and “sgn” is the sign function. This rule is know as the The soft threshold!!! 37 / 66
  • 91. Example We have the doted · · · line as the soft threshold a 3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 38 / 66
  • 92. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 39 / 66
  • 93. Now Given that each βi has a zero-mean Gaussian prior p (βi|τi) = N (βi|0, τi) (27) Where τi has the following exponential hyperprior p (τi|γ) = γ 2 exp − γ 2 τi for τi ≥ 0 (28) This is a though property - so we take it by heart p (βi|γ) = ˆ ∞ 0 p (βi|τi) p (τi|γ) dτi = √ γ 2 exp {− √ γ |βi|} (29) 40 / 66
  • 94. Now Given that each βi has a zero-mean Gaussian prior p (βi|τi) = N (βi|0, τi) (27) Where τi has the following exponential hyperprior p (τi|γ) = γ 2 exp − γ 2 τi for τi ≥ 0 (28) This is a though property - so we take it by heart p (βi|γ) = ˆ ∞ 0 p (βi|τi) p (τi|γ) dτi = √ γ 2 exp {− √ γ |βi|} (29) 40 / 66
  • 95. Now Given that each βi has a zero-mean Gaussian prior p (βi|τi) = N (βi|0, τi) (27) Where τi has the following exponential hyperprior p (τi|γ) = γ 2 exp − γ 2 τi for τi ≥ 0 (28) This is a though property - so we take it by heart p (βi|γ) = ˆ ∞ 0 p (βi|τi) p (τi|γ) dτi = √ γ 2 exp {− √ γ |βi|} (29) 40 / 66
  • 97. This is equivalent to the use of the L1-norm for regularization L1(Left) is better for sparsity promotion than L2(Right) 42 / 66
  • 98. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 43 / 66
  • 99. The EM trick How do we do this? This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data Then, if we could observe τ, complete log-posterior log p (β, σ2 |y, τ) can be easily calculated p β, σ2 |y, τ ∝ p y|β, σ2 p (β|τ) p σ2 (30) Where p y|β, σ2 ∼ N β, σ2 p (β|0, τ) ∼ N (0, τ) 44 / 66
  • 100. The EM trick How do we do this? This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data Then, if we could observe τ, complete log-posterior log p (β, σ2 |y, τ) can be easily calculated p β, σ2 |y, τ ∝ p y|β, σ2 p (β|τ) p σ2 (30) Where p y|β, σ2 ∼ N β, σ2 p (β|0, τ) ∼ N (0, τ) 44 / 66
  • 101. The EM trick How do we do this? This is done by regarding τ = [τ1, ..., τk] as the hidden/missing data Then, if we could observe τ, complete log-posterior log p (β, σ2 |y, τ) can be easily calculated p β, σ2 |y, τ ∝ p y|β, σ2 p (β|τ) p σ2 (30) Where p y|β, σ2 ∼ N β, σ2 p (β|0, τ) ∼ N (0, τ) 44 / 66
  • 102. What about p (σ2 )? We select p σ2 as a constant However We can adopt a conjugate inverse Gamma prior for σ2, but for large number of samples the prior on the estimate of σ2is very small. In the constant case We can use the MAP idea, however we have hidden parameters so we resort to the EM 45 / 66
  • 103. What about p (σ2 )? We select p σ2 as a constant However We can adopt a conjugate inverse Gamma prior for σ2, but for large number of samples the prior on the estimate of σ2is very small. In the constant case We can use the MAP idea, however we have hidden parameters so we resort to the EM 45 / 66
  • 104. What about p (σ2 )? We select p σ2 as a constant However We can adopt a conjugate inverse Gamma prior for σ2, but for large number of samples the prior on the estimate of σ2is very small. In the constant case We can use the MAP idea, however we have hidden parameters so we resort to the EM 45 / 66
  • 105. E-step Computes the expected value of the complete log-posterior Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ (31) 46 / 66
  • 106. M-step Updates the parameter estimates by maximizing the Q-function β(t+1), σ2 (t+1) = argmax β,σ2 Q β, σ2 |β(t), σ2 (t) (32) 47 / 66
  • 107. Remark First The EM algorithm converges to a local maximum of the a posteriori probability density function p β, σ2 |y ∝ p y|β, σ2 p (β|γ) (33) Without using the marginal prior p (β|γ) which is not Gaussian Instead we use a conditional Gaussian prior p (β|γ) 48 / 66
  • 108. Remark First The EM algorithm converges to a local maximum of the a posteriori probability density function p β, σ2 |y ∝ p y|β, σ2 p (β|γ) (33) Without using the marginal prior p (β|γ) which is not Gaussian Instead we use a conditional Gaussian prior p (β|γ) 48 / 66
  • 109. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66
  • 110. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66
  • 111. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66
  • 112. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66
  • 113. The Final Model We have p y|β, σ2 = N Hβ, σ2 I p σ2 ∝ ”constant” p (β|τ) = k i=1 N (βi|0, τi) = N β|0, (Υ (τ))−1 p (τ|γ) = γ 2 k k i=1 exp − γ 2 τi With Υ (τ) = diag τ−1 1 , ..., τ−1 k is the diagonal matrix with the inverse variances of all the βi’s. 49 / 66
  • 114. Now, we find the Q function First log p β, σ2 |y, τ ∝ log p y|β, σ2 + log p (β|τ) ∝ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β How can we get this? Remember N (y|µ, Σ) = 1 (2π) k 2 |Σ| 1 2 exp − 1 2 (y − µ)T Σ−1 (34) Volunteers? Please to the blackboard. 50 / 66
  • 115. Now, we find the Q function First log p β, σ2 |y, τ ∝ log p y|β, σ2 + log p (β|τ) ∝ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β How can we get this? Remember N (y|µ, Σ) = 1 (2π) k 2 |Σ| 1 2 exp − 1 2 (y − µ)T Σ−1 (34) Volunteers? Please to the blackboard. 50 / 66
  • 116. Now, we find the Q function First log p β, σ2 |y, τ ∝ log p y|β, σ2 + log p (β|τ) ∝ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β How can we get this? Remember N (y|µ, Σ) = 1 (2π) k 2 |Σ| 1 2 exp − 1 2 (y − µ)T Σ−1 (34) Volunteers? Please to the blackboard. 50 / 66
  • 117. Thus Second Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and the other terms do not depend on τ? Thus, the E-step is reduced to the computation of Υ (τ) V(t) = E Υ (τ) |y, β(t), σ2 (t) = diag E τ−1 1 |y, β(t), σ2 (t) , ..., E τ−1 k |y, β(t), σ2 (t) 51 / 66
  • 118. Thus Second Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and the other terms do not depend on τ? Thus, the E-step is reduced to the computation of Υ (τ) V(t) = E Υ (τ) |y, β(t), σ2 (t) = diag E τ−1 1 |y, β(t), σ2 (t) , ..., E τ−1 k |y, β(t), σ2 (t) 51 / 66
  • 119. Thus Second Did you notice that the term βT Υ (τ) β is linear with respect to Υ (τ) and the other terms do not depend on τ? Thus, the E-step is reduced to the computation of Υ (τ) V(t) = E Υ (τ) |y, β(t), σ2 (t) = diag E τ−1 1 |y, β(t), σ2 (t) , ..., E τ−1 k |y, β(t), σ2 (t) 51 / 66
  • 120. Now What do we need to calculate each of this expectations? p τi|y, β(t), σ2 (t) = p τi|y, βi,(t), σ2 i,(t) (35) Then p τi|y, βi,(t), σ2 i,(t) = p τi, βi,(t)|y, σ2 i,(t) p βi,(t)|y, σ2 i,(t) ∝ p βi,(t)|τi, y, σ2 i,(t) p τi|y, σ2 i,(t) = p βi,(t)|τi p (τi) 52 / 66
  • 121. Now What do we need to calculate each of this expectations? p τi|y, β(t), σ2 (t) = p τi|y, βi,(t), σ2 i,(t) (35) Then p τi|y, βi,(t), σ2 i,(t) = p τi, βi,(t)|y, σ2 i,(t) p βi,(t)|y, σ2 i,(t) ∝ p βi,(t)|τi, y, σ2 i,(t) p τi|y, σ2 i,(t) = p βi,(t)|τi p (τi) 52 / 66
  • 122. Now What do we need to calculate each of this expectations? p τi|y, β(t), σ2 (t) = p τi|y, βi,(t), σ2 i,(t) (35) Then p τi|y, βi,(t), σ2 i,(t) = p τi, βi,(t)|y, σ2 i,(t) p βi,(t)|y, σ2 i,(t) ∝ p βi,(t)|τi, y, σ2 i,(t) p τi|y, σ2 i,(t) = p βi,(t)|τi p (τi) 52 / 66
  • 123. Now What do we need to calculate each of this expectations? p τi|y, β(t), σ2 (t) = p τi|y, βi,(t), σ2 i,(t) (35) Then p τi|y, βi,(t), σ2 i,(t) = p τi, βi,(t)|y, σ2 i,(t) p βi,(t)|y, σ2 i,(t) ∝ p βi,(t)|τi, y, σ2 i,(t) p τi|y, σ2 i,(t) = p βi,(t)|τi p (τi) 52 / 66
  • 124. Here is the interesting part We have the following probability density p τi|y, βi,(t), σ2 i,(t) = N βi,(t)|0, τi γ 2 exp −γ 2 τi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (36) Then E τ−1 1 |y, βi,(t), σ2 (t) = ´ ∞ 0 1 τi N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (37) Now I leave to you to prove that (It can come in the test) 53 / 66
  • 125. Here is the interesting part We have the following probability density p τi|y, βi,(t), σ2 i,(t) = N βi,(t)|0, τi γ 2 exp −γ 2 τi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (36) Then E τ−1 1 |y, βi,(t), σ2 (t) = ´ ∞ 0 1 τi N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (37) Now I leave to you to prove that (It can come in the test) 53 / 66
  • 126. Here is the interesting part We have the following probability density p τi|y, βi,(t), σ2 i,(t) = N βi,(t)|0, τi γ 2 exp −γ 2 τi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (36) Then E τ−1 1 |y, βi,(t), σ2 (t) = ´ ∞ 0 1 τi N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi ´ ∞ 0 N βi,(t)|0, τi γ 2 exp −γ 2 τi dτi (37) Now I leave to you to prove that (It can come in the test) 53 / 66
  • 127. Here is the interesting part Thus E τ−1 1 |y, βi,(t), σ2 (t) = γ βi,(t) (38) Finally V(t) = γdiag β1,(t) −1 , ..., βk,(t) −1 (39) 54 / 66
  • 128. Here is the interesting part Thus E τ−1 1 |y, βi,(t), σ2 (t) = γ βi,(t) (38) Finally V(t) = γdiag β1,(t) −1 , ..., βk,(t) −1 (39) 54 / 66
  • 129. The Final Q function Something Notable Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ = ˆ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β p τ|β(t), σ2 (t), y dτ = −n log σ2 − y − Hβ 2 2 σ2 − βT ˆ Υ (τ) p τ|β(t), σ2 (t), y dτ β = −n log σ2 − y − Hβ 2 2 σ2 − βT V(t)β 55 / 66
  • 130. The Final Q function Something Notable Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ = ˆ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β p τ|β(t), σ2 (t), y dτ = −n log σ2 − y − Hβ 2 2 σ2 − βT ˆ Υ (τ) p τ|β(t), σ2 (t), y dτ β = −n log σ2 − y − Hβ 2 2 σ2 − βT V(t)β 55 / 66
  • 131. The Final Q function Something Notable Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ = ˆ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β p τ|β(t), σ2 (t), y dτ = −n log σ2 − y − Hβ 2 2 σ2 − βT ˆ Υ (τ) p τ|β(t), σ2 (t), y dτ β = −n log σ2 − y − Hβ 2 2 σ2 − βT V(t)β 55 / 66
  • 132. The Final Q function Something Notable Q β, σ2 |β(t), σ2 (t) = ˆ log p β, σ2 |y, τ p τ|β(t), σ2 (t), y dτ = ˆ −n log σ2 − y − Hβ 2 2 σ2 − βT Υ (τ) β p τ|β(t), σ2 (t), y dτ = −n log σ2 − y − Hβ 2 2 σ2 − βT ˆ Υ (τ) p τ|β(t), σ2 (t), y dτ β = −n log σ2 − y − Hβ 2 2 σ2 − βT V(t)β 55 / 66
  • 133. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66
  • 134. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66
  • 135. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66
  • 136. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66
  • 137. Finally, the M-step First σ2 (t+1) = argmax σ2 −n log σ2 − y − Hβ 2 2 σ2 = y − Hβ 2 2 n Second β(t+1) = argmax β − y − Hβ 2 2 σ2 − βT V(t)β = σ2 (t+1)V(t) + HT H −1 HT y This also I leave to you It can come in the test. 56 / 66
  • 138. Outline 1 Introduction A first solution Example Properties of the MAP 2 Example of Application of MAP and EM Example Linear Regression The Gaussian Noise A Hierarchical-Bayes View of the Laplacian Prior Sparse Regression via EM Jeffrey’s Prior 57 / 66
  • 139. However We need to deal in some way with the γ term It controls the degree of spareness!!! We can do assuming a Jeffrey’s Prior J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag, 1980. We use instead p (τ) ∝ 1 τ (40) 58 / 66
  • 140. However We need to deal in some way with the γ term It controls the degree of spareness!!! We can do assuming a Jeffrey’s Prior J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag, 1980. We use instead p (τ) ∝ 1 τ (40) 58 / 66
  • 141. However We need to deal in some way with the γ term It controls the degree of spareness!!! We can do assuming a Jeffrey’s Prior J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag, 1980. We use instead p (τ) ∝ 1 τ (40) 58 / 66
  • 142. Properties of the Jeffrey’s Prior Important This prior expresses ignorance with respect to scale and is parameter free Why scale invariant Imagine, we change the scale of τ by τ = Kτ where K is a constant expressing that change Thus, we have that p τ = 1 τ = 1 kτ ∝ 1 τ (41) 59 / 66
  • 143. Properties of the Jeffrey’s Prior Important This prior expresses ignorance with respect to scale and is parameter free Why scale invariant Imagine, we change the scale of τ by τ = Kτ where K is a constant expressing that change Thus, we have that p τ = 1 τ = 1 kτ ∝ 1 τ (41) 59 / 66
  • 144. Properties of the Jeffrey’s Prior Important This prior expresses ignorance with respect to scale and is parameter free Why scale invariant Imagine, we change the scale of τ by τ = Kτ where K is a constant expressing that change Thus, we have that p τ = 1 τ = 1 kτ ∝ 1 τ (41) 59 / 66
  • 145. However Something Notable This prior is known as an improper prior. In addition This prior does not leads to a Laplacian prior on β. Nevertheless This prior induces sparseness and good performance for the β. 60 / 66
  • 146. However Something Notable This prior is known as an improper prior. In addition This prior does not leads to a Laplacian prior on β. Nevertheless This prior induces sparseness and good performance for the β. 60 / 66
  • 147. However Something Notable This prior is known as an improper prior. In addition This prior does not leads to a Laplacian prior on β. Nevertheless This prior induces sparseness and good performance for the β. 60 / 66
  • 148. Introducing this prior into the equations Matrix V(t) is now V(t) = diag β1,(t) −2 , ..., βk,(t) −2 (42) Quite interesting!!! We do not have the free γ parameter. 61 / 66
  • 149. Introducing this prior into the equations Matrix V(t) is now V(t) = diag β1,(t) −2 , ..., βk,(t) −2 (42) Quite interesting!!! We do not have the free γ parameter. 61 / 66
  • 150. Here, we can see the new threshold With dotted line as the soft threshold, dashed the hard threshold and in blue solid the Jeffrey’s prior a 3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 62 / 66
  • 151. Observations The new rule is between The soft threshold rule. The hard threshold rule. Something Notable With large values of HT y i the new rule approaches the hard threshold. Once HT y i gets smaller The estimate becomes progressively smaller approaching the behavior of the soft rule. 63 / 66
  • 152. Observations The new rule is between The soft threshold rule. The hard threshold rule. Something Notable With large values of HT y i the new rule approaches the hard threshold. Once HT y i gets smaller The estimate becomes progressively smaller approaching the behavior of the soft rule. 63 / 66
  • 153. Observations The new rule is between The soft threshold rule. The hard threshold rule. Something Notable With large values of HT y i the new rule approaches the hard threshold. Once HT y i gets smaller The estimate becomes progressively smaller approaching the behavior of the soft rule. 63 / 66
  • 154. Finally, an implementation detail Since several elements of β will go to zero V(t) = diag β1,(t) −2 , ..., βk,(t) −2 will have several elements going to large numbers Something Notable if we define U(t) = diag β1,(t) , ..., βk,(t) . Then, we have that V(t) = U−1 (t) U−1 (t) (43) 64 / 66
  • 155. Finally, an implementation detail Since several elements of β will go to zero V(t) = diag β1,(t) −2 , ..., βk,(t) −2 will have several elements going to large numbers Something Notable if we define U(t) = diag β1,(t) , ..., βk,(t) . Then, we have that V(t) = U−1 (t) U−1 (t) (43) 64 / 66
  • 156. Finally, an implementation detail Since several elements of β will go to zero V(t) = diag β1,(t) −2 , ..., βk,(t) −2 will have several elements going to large numbers Something Notable if we define U(t) = diag β1,(t) , ..., βk,(t) . Then, we have that V(t) = U−1 (t) U−1 (t) (43) 64 / 66
  • 157. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66
  • 158. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66
  • 159. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66
  • 160. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66
  • 161. Thus We have that β(t+1) = σ2 (t+1)V(t) + HT H −1 HT y = σ2 (t+1)U−1 (t) U−1 (t) + HT H −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + IHT HI −1 HT y = σ2 (t+1)U−1 (t) IU−1 (t) + U−1 (t) U(t)HT HU(t)U−1 (t) −1 HT y = U(t) σ2 (t+1)I + U(t)HT HU(t) −1 U(t)HT y 65 / 66
  • 162. Advantages!!! Quite Important We avoid the inversion of the elements of β(t). We can avoid getting the inverse matrix We simply solve the corresponding linear system whose dimension is only the number of nonzero elements in U(t). Why? Remember you want to maximize Q β, σ2|β(t), σ2 (t) = −n log σ2 − y−Hβ 2 2 σ2 − βT V(t)β 66 / 66
  • 163. Advantages!!! Quite Important We avoid the inversion of the elements of β(t). We can avoid getting the inverse matrix We simply solve the corresponding linear system whose dimension is only the number of nonzero elements in U(t). Why? Remember you want to maximize Q β, σ2|β(t), σ2 (t) = −n log σ2 − y−Hβ 2 2 σ2 − βT V(t)β 66 / 66
  • 164. Advantages!!! Quite Important We avoid the inversion of the elements of β(t). We can avoid getting the inverse matrix We simply solve the corresponding linear system whose dimension is only the number of nonzero elements in U(t). Why? Remember you want to maximize Q β, σ2|β(t), σ2 (t) = −n log σ2 − y−Hβ 2 2 σ2 − βT V(t)β 66 / 66