SlideShare a Scribd company logo
1 of 10
Download to read offline
Project 2. The EM Algorithm.
0. The expression of πi,k
For a distribution that belongs to exponential family, we have:
Pθ(x, z) = eθT S(x,π)−logC(θ)
h(x, π) (1)
Taking the logarithm of Eq.(1), we have:
logPθ(x, z) = θT
S(x, π) − logC(θ) + logh(x, π) (2)
Take the conditional expectation with the sample (fixed) data x and with the current knowl-
edge of parameters of the distribution θ(t), we obtain the Q(θ, θ(T)) of the E step of the EM
algorithm.
We can maximum the Q(θ, θ(T)) with each parameters to obtain the optimal value of
the θ for the next step. This maximization can be achieved by differentiating Q(θ, θ(T)) by
each θi.
Q(θ, θ(t)) = Eθ(t)[logP(θ)|X = x] = θT
Eθ(t)[S(x, π)|X = x] − logC(θ) + logh(x, π) (3)
∂
∂θi
Q(θ, θ(t)) = Eθ(t)[Si(X, Ψ)|X = x] − Eθ[Si(X, Ψ)] (4)
The results can be shown in Eq.(5). Here Eq.(5) can be rewritten as Eq.(6), to give a clear
prescription for the updates in the M step.
Eθ(t)[Si(X, Ψ)|X = x] = Eθ[Si(X, Ψ)] (5)
Eθ(t−1)[Si(X, Ψ)|X = x] = Eθ(t)[Si(X, Ψ)] (6)
Here Si is a sufficient statistics. Ψ is the latent parameter, and π is its estimate. In each
iteration, θ(t) should be solved with θ(t − 1) and the given data.
For the mixed normal model, we have
P(θ, Ψ|X)) =
N
i=1
Pθ(xi, πi) = e
K
k=1 βk( N
i=1 zik)− 1
2
K
k=1( N
i=1 zikxT
i Σ−1
k xi)+ K
k=1
N
i=1(zikxT
i Σ−1
k µk)
(7)
βk = Log(πk) +
1
2
Log|Σ−1
k | −
1
2
µT
k Σ−1
k µk (8)
Thus we can read off the sufficient statistics:
N
i=1
Ψik,
N
i=1
ΨikXiXT
i ,
N
i=1
ΨikXT
i (9)
1
Plugging each of the above sufficient statistics into Eq.(8) should yield the desired results.
However, here I want to derive the updating equations directly. The log-likelihood can be
shown as
LogP(θ, Ψ|X)) =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)zik (10)
where zik = I(xi = Ck). Then, we can calculate the Q(θ, θ(t)) as follows:
Q(θ, θ(t)) = EΨ|θ(t)[LogP(θ, Ψ|X)] =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)EΨ|θ(t)[δ(xi = Ck)]
(11)
where
EΨ|θ(t)[δ(xi = Ck)] = P(xi = Ck|θ(t)
, Xi) =
N(Xi|µk, Σk)π
(t)
ik
K
k=1 N(Xi|µk, Σk)π
(t)
ik
= πik (12)
Then we have
Q(θ, θ(t)) =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)πik (13)
1. Derivation of µk.
The µ
(t)
k can be obtained by solving
∂Q(θ, θ(t))
∂µk
= 0 (14)
This yields
N
i=1
K
k =1
∂[(xi − µk )T
Σ−1
k (xi − µk )πik ]
∂µk
= 0 (15)
or
N
i=1
∂[(xi − µk)T
Σ−1
k (xi − µk)πik]
∂µk
= 0 (16)
According to vector calculus, the above equation can be simplified into
N
i=1
(xi − µk)πik = 0 (17)
which ultimately gives
µ
(t)
k =
N
i=1
π
(t)
ik xi/
N
i=1
π
(t)
ik (18)
2
2. Derivation of Σk.
The Σ
(t)
k can be obtained by solving
∂Q(θ, θ(t))
∂Σk
= 0 (19)
which is equivalent to
N
i=1
K
k =1
∂[Log|Σ−1
k | − (xi − µk )T
Σ−1
k (xi − µk )πik ]
∂Σk
= 0 (20)
or
N
i=1
∂[Log|Σ−1
k | − (xi − µk)T
Σ−1
k (xi − µk)πik]
∂Σk
= 0 (21)
This can be simplified into
N
i=1
[Σk − (xi − µk)(xi − µk)T
]πik = 0 (22)
which ultimately gives
Σ
(t)
k =
N
i=1
π
(t)
ik (xi − µ
(t)
k )(xi − µ
(t)
k )T
/
N
i=1
π
(t)
ik (23)
3. Statistical Analysis.
To study the performance of the EM method, we first need to come up with different types
of training (sample) data. These data can be drawn from different mixture models (different
πi, µi
, Σi). The sample size of the data can also be different. The design will be shown in
Section 3.1.
To measure the performance, two parameters are of interest. The first is the absolute
deviation of the converged value coming out of the EM algorithm, namely Dev(ˆπ1)=|ˆπ1 −π1|.
(Here we consider a computation converges if ˆπ1[t] = ˆπ1[t − 9], or ten iterations in a row
yield the same estimated values.) The second is the total number of iterations used to the
convergence. In this project, every computational job takes less than 500 iterations and can
be finished in less than 3 minutes. Under the knowledge that time efficiency is not a problem
for the designed computational experiment, we mainly focus on the absolute deviation of the
estimations.
3.1 Experimental Design
To statistically analyze the performance of the EM method. I designed a full-factorial ex-
periment with four main factors. The first factor (A) is the sample size, the second factor
(B) is the weight (π1, π2) for sampling from the first or the second Gauss distribution, the
3
third factor (C) specifies the µ1
, µ2
for the two Gauss distributions, and the fourth factor
(D) specifies the Σ1, Σ2 for the two Gauss distributions. Computational experiment with all
combinations of A,B,C and D are investigated, as shown in Table 2. From Run 1 to Run
16, each combination of the four factors are investigated. For each run of experiment, 6
replications are generated.
Factor - vs +
A Data Sample Size 100 vs 50
B π1, π2 0.1, 0.9 vs 0.5, 0.5
C µ1
, µ2
1
2
,
2
1
vs
1
5
,
5
1
D Σ1, Σ2
3 3
3 5
,
3 3
3 5
vs
3 −3
−3 5
,
3 −3
−3 5
Table 1: Factors and levels of the four-factor full factorial experiment design.
With different combinations of the factors B,C and D, we can first have a look how the
mixed distributions look like. Figure 1 shows the contour plots of the mixed distribution
for Run 1 to Run 4, where π1 = 0.1, π2 = 0.9. Figure 2 shows the contour plots of the
distribution for Run 5 to Run 8, where π1 = 0.5, π2 = 0.5. It can be easily found that only
the distribution from the third plot of Figure 2 show two distinct modes. We expect the EM
algorithm to function well in this case. For other cases, we need to run the code in order to
reach an accurate conclusion.
3.2 Experiment Results and Analysis
In each computation, the converged ˆπ1, ˆπ2, ˆµ1
, ˆµ2
, and ˆΣ1, ˆΣ2 are obtained. However, it is
not necessary to check every one of them to study the performance of the EM algorithm.
(When the value of one parameter stabilizes, values of all other parameters stabilize.) In
Table II, I recorded the absolute deviation of the estimated ˆπ1’s in each replicates, which is
defined as: |ˆπ1j − π1|, j=1,...,6. Taking average of the deviations among the 6 replicates, we
can obtain the average of the absolute deviation, as defined by Dev(π1) = 6
j=1 |ˆπ1j − π1|,
where j is the index for each replicate in a run.
Look at the Dev(π1) values in each run from Table 2, we can easily find that the
estimates are sometimes unacceptable. For example, in Run 2, Run 5, Run 6, Run 8, Run 9,
Run 10, Run 13 and Run 16 have their Dev(π1) larger than 0.1. Since the estimates in half
of the runs are not acceptable, the validity of the EM method is largely dependent on what
the distribution the training data are drawn from, as well as the size of the training data.
In the following, we will study the main effects of the four factors (related to the mix-model
distribution and the sample size) on the Dev(π1) obtained by the EM method.
From the Main Effects Plot in Figure 3, we can find the factor C show the largest main
effect. Given data from mixed distributions with
1
5
,
5
1
, the average of the Dev(π1)
become much smaller than that where data are drawn from distributions with
1
2
,
2
1
.
4
Run ABCD Rep1 Rep2 Rep3 Rep4 Rep5 Rep6 Dev(π1)
1 - - - - 0.04476 0.16758 0.02807 0.01891 0.07494 0.08944 0.07061
2 - - - + 0.04059 0.07746 0.07168 0.21808 0.04810 0.16695 0.10381
3 - - + - 0.01506 0.00005 0.00000 0.00014 0.00000 0.00149 0.00279
4 - - + + 0.02462 0.06744 0.06470 0.01486 0.00929 0.02744 0.03472
5 - + - - 0.18725 0.13970 0.20257 0.07702 0.08960 0.04289 0.12317
6 - + - + 0.32204 0.38365 0.05723 0.19782 0.14263 0.10631 0.20161
7 - + + - 0.00014 0.00172 0.00070 0.00001 0.00000 0.00094 0.00059
8 - + + + 0.04974 0.15832 0.01410 0.12716 0.12529 0.13536 0.10166
9 + - - - 0.56811 0.56536 0.18880 0.23188 0.04571 0.10274 0.28377
10 + - - + 0.20829 0.49863 0.33953 0.39940 0.01038 0.09023 0.25774
11 + - + - 0.00000 0.00007 0.00000 0.00000 0.00001 0.00050 0.00010
12 + - + + 0.00617 0.04590 0.03875 0.03190 0.04809 0.01020 0.03017
13 + + - - 0.15656 0.30134 0.03449 0.04952 0.23108 0.25004 0.17051
14 + + - + 0.26834 0.06553 0.04031 0.12802 0.03216 0.00339 0.08962
15 + + + - 0.00000 0.00118 0.00000 0.00013 0.00004 0.01903 0.00340
16 + + + + 0.13104 0.21445 0.15344 0.30326 0.26638 0.01643 0.18083
Table 2: The four-factor full factorial design and the experiment results. The values under
the Rep1 to Rep6 columns are absolute deviations (|ˆπ1j − π1|, j=1,...,6) as calculated in
different replicates. The last column gives the averaged values of the absolute deviations as
calculated in each run (denoted by Dev(π1) ).
Or simply put, the farther the two mean vectors are separated, the more accurate (smaller
averaged Dev(π1) ) the estimates of the EM method can be. Similarly, we can see for factor
A, a larger sample size yields a bigger accuracy. And for factor D, distribution shapes (for
each single normal distribution) that favor the overlap of the two normal distributions will
result in difficulty of estimation, or larger Dev(π1) from the EM algorithm (check the 3rd
and 4th plot of Figure 2).
Next, we can study the interactions of different factors. In Figure 4 we show the in-
teraction plots of the 6 pair of factors. Interestingly, the A (sample size) and B (π1, π2)
are antagonistic. All other pairs show synergistic relationship. As calculated by the linear
regression, the two factor interaction effects are:
INT(A,B)=-0.042814,
INT(A,C)=-0.028462,
INT(A,D)=-0.018004,
INT(B,C)=0.043716,
INT(B,D)=0.025862,
INT(C,D)=0.041974.
Since none of these interaction effects looks significant, I want to have a thorough study
of which effects are significant. The coefficients of the regression of yi = β0 + 15
j=1 βjxij + εi
(where xi1 = Ai, xi2 = Bi, xi3 = Ci, xi4 = Di, xi5 = AiBi, xi6 = AiCi, xi7 = AiDi, xi8 =
BiCi, xi9 = BiDi, xi10 = CiDi, xi11 = AiBiCi, xi12 = AiBiDi, xi13 = AiCiDi, xi14 = BiCiDi,
xi15 = AiBiCiDi) will help us to calculate the all the 15 effects.
The main effects are calculated as:
5
ME(A)=0.047146,
ME(B)=0.010960,
ME(C)=-0.118324,
ME(D)=0.043156.
The three-factor interaction effects are calculated as:
INT(A,B,C)=0.065122,
INT(A,B,D)=-0.002734,
INT(A,C,D)=0.036630,
INT(B,C,D)=0.028266.
And the four-factor interaction effect is:
INT(A,B,C,D)= 0.022292.
The first five largest effects include: ME(C)>INT(A,B,C)>ME(A)>INT(B,C)>ME(D).
We can also use the Lenth’s method to test which effects are significant. The Lenth’s
plot with significance level α = 0.1 is shown in Figure 5. From this figure we find that, at
significance level of 0.1, only factor C (separation of the two mean vectors) is the effect that
influences the value of Dev(π1) . In Figure 6, Lenth’s plot with α = 0.3 is given. Here
the conclusion is that there are two significant effects, one is the factor C, and one is the
interaction of A,B and C (the interaction of the sample size, the πi values and the µi
values).
One remark is that, not like factor A, C and D, the way we assign weights to each normal
distributions (values of the mixing coefficients in factor B) is in itself relatively incapable to
affect the value of Dev(π1) .
6
Contour Plot
µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = 0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = −0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = 0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = −0.7745967
0 1 2 3 4 5 6
0123456
Figure 1: Contour plots of the mixture normal distributions with π1 = 0.1, π2 = 0.9.
Contour Plot
µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = 0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = −0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = 0.7745967
0 1 2 3 4 5 6
0123456
Contour Plot
µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1
2
= 3 σ2
2
= 5 ρ1 = −0.7745967
0 1 2 3 4 5 6
0123456
Figure 2: Contour plots of the mixture normal distributions with π1 = 0.5, π2 = 0.5.
7
Main Effects Plot
A B C D
0.040.080.120.16
Figure 3: The main effects plot for the four main factors: sample size, latent parameter true
values, mean true values, and covariance matrix true values. The y-axis is the average of the
Bias(ˆπ1) with a fixed level of the given main effect.
8
0.060.100.14
AB
B
meanofy
−1 1
A
1
−1
0.050.15
AC
C
meanofy
−1 1
A
1
−1
0.060.100.14
AD
D
meanofy
−1 1
A
1
−1
0.050.15
BC
C
meanofy
−1 1
B
1
−1
0.080.110.14
BD
D
meanofy
−1 1
B
1
−1
0.000.10
CD
D
meanofy
−1 1
C
−1
1
Figure 4: The interaction effects plot for different combination of main factors.
9
Lenth's Plot with Significance Level Alpha=0.1
factors
effects
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
−0.100.000.10
ME
ME
SME
SME
Figure 5: Length’s Plot of all the main and interaction effects with α = 0.1.
Lenth's Plot with Significance Level Alpha=0.3
factors
effects
A B C D AB AC AD BC BD CD ABC ABD ACD BCDABCD
−0.100.00
ME
ME
SME
Figure 6: Length’s Plot of all the main and interaction effects with α = 0.3.
10

More Related Content

What's hot

Applied numerical methods lec8
Applied numerical methods lec8Applied numerical methods lec8
Applied numerical methods lec8Yasser Ahmed
 
Non linear curve fitting
Non linear curve fitting Non linear curve fitting
Non linear curve fitting Anumita Mondal
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-V
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-VEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-V
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-VRai University
 
Least Square Optimization and Sparse-Linear Solver
Least Square Optimization and Sparse-Linear SolverLeast Square Optimization and Sparse-Linear Solver
Least Square Optimization and Sparse-Linear SolverJi-yong Kwon
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleMarjan Sterjev
 
Complex Variable & Numerical Method
Complex Variable & Numerical MethodComplex Variable & Numerical Method
Complex Variable & Numerical MethodNeel Patel
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite IntegralRich Elle
 
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...csandit
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)jemille6
 
Method of least square
Method of least squareMethod of least square
Method of least squareSomya Bagai
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct ProblemKai Zhang
 
Basic concepts of curve fittings
Basic concepts of curve fittingsBasic concepts of curve fittings
Basic concepts of curve fittingsTarun Gehlot
 
6334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-26334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-2jemille6
 

What's hot (20)

Applied numerical methods lec8
Applied numerical methods lec8Applied numerical methods lec8
Applied numerical methods lec8
 
Pattern Discovery - part I
Pattern Discovery - part IPattern Discovery - part I
Pattern Discovery - part I
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
 
Q0749397
Q0749397Q0749397
Q0749397
 
Business statistics homework help
Business statistics homework helpBusiness statistics homework help
Business statistics homework help
 
Non linear curve fitting
Non linear curve fitting Non linear curve fitting
Non linear curve fitting
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-V
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-VEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-V
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-V
 
Least Square Optimization and Sparse-Linear Solver
Least Square Optimization and Sparse-Linear SolverLeast Square Optimization and Sparse-Linear Solver
Least Square Optimization and Sparse-Linear Solver
 
Unit 7.1
Unit 7.1Unit 7.1
Unit 7.1
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation Example
 
Complex Variable & Numerical Method
Complex Variable & Numerical MethodComplex Variable & Numerical Method
Complex Variable & Numerical Method
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)
 
Method of least square
Method of least squareMethod of least square
Method of least square
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Statistics Homework Help
Statistics Homework HelpStatistics Homework Help
Statistics Homework Help
 
Unit 7.5
Unit 7.5Unit 7.5
Unit 7.5
 
Basic concepts of curve fittings
Basic concepts of curve fittingsBasic concepts of curve fittings
Basic concepts of curve fittings
 
6334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-26334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-2
 

Viewers also liked

Viewers also liked (10)

UNIDAD EDUCATIVA KASAMA
UNIDAD EDUCATIVA KASAMAUNIDAD EDUCATIVA KASAMA
UNIDAD EDUCATIVA KASAMA
 
Issue 12
Issue 12Issue 12
Issue 12
 
2014 USPT Technical Overview Tradeshow Banner
2014 USPT Technical Overview Tradeshow Banner2014 USPT Technical Overview Tradeshow Banner
2014 USPT Technical Overview Tradeshow Banner
 
Issue13
Issue13Issue13
Issue13
 
issue15
issue15issue15
issue15
 
UNIDAD EDUCATIVA KASAMA
UNIDAD EDUCATIVA KASAMAUNIDAD EDUCATIVA KASAMA
UNIDAD EDUCATIVA KASAMA
 
Olivia Binczewski PPP
Olivia Binczewski PPPOlivia Binczewski PPP
Olivia Binczewski PPP
 
education
educationeducation
education
 
Tools used for Woodend Media Horror poster
Tools used for Woodend Media Horror poster Tools used for Woodend Media Horror poster
Tools used for Woodend Media Horror poster
 
Annotations for DPS
Annotations for DPS Annotations for DPS
Annotations for DPS
 

Similar to Project2

PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfssusera1eccd
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iAlexander Decker
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iAlexander Decker
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and EstimationLouis A. Poulin
 
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...cscpconf
 
Numerical Evaluation of Complex Integrals of Analytic Functions
Numerical Evaluation of Complex Integrals of Analytic FunctionsNumerical Evaluation of Complex Integrals of Analytic Functions
Numerical Evaluation of Complex Integrals of Analytic Functionsinventionjournals
 
STT802project-writeup-Final (1)
STT802project-writeup-Final (1)STT802project-writeup-Final (1)
STT802project-writeup-Final (1)James P. Regan II
 
Ip 5 discrete mathematics
Ip 5 discrete mathematicsIp 5 discrete mathematics
Ip 5 discrete mathematicsMark Simon
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Probabilistic approach to prime counting
Probabilistic approach to prime countingProbabilistic approach to prime counting
Probabilistic approach to prime countingChris De Corte
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMWireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Alexander Litvinenko
 
Lecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typedLecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typedcairo university
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxGairuzazmiMGhani
 

Similar to Project2 (20)

PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and Estimation
 
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
 
Seattle.Slides.7
Seattle.Slides.7Seattle.Slides.7
Seattle.Slides.7
 
Numerical Evaluation of Complex Integrals of Analytic Functions
Numerical Evaluation of Complex Integrals of Analytic FunctionsNumerical Evaluation of Complex Integrals of Analytic Functions
Numerical Evaluation of Complex Integrals of Analytic Functions
 
STT802project-writeup-Final (1)
STT802project-writeup-Final (1)STT802project-writeup-Final (1)
STT802project-writeup-Final (1)
 
Ip 5 discrete mathematics
Ip 5 discrete mathematicsIp 5 discrete mathematics
Ip 5 discrete mathematics
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Probabilistic approach to prime counting
Probabilistic approach to prime countingProbabilistic approach to prime counting
Probabilistic approach to prime counting
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
 
Lecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typedLecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typed
 
Regression
RegressionRegression
Regression
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 

Project2

  • 1. Project 2. The EM Algorithm. 0. The expression of πi,k For a distribution that belongs to exponential family, we have: Pθ(x, z) = eθT S(x,π)−logC(θ) h(x, π) (1) Taking the logarithm of Eq.(1), we have: logPθ(x, z) = θT S(x, π) − logC(θ) + logh(x, π) (2) Take the conditional expectation with the sample (fixed) data x and with the current knowl- edge of parameters of the distribution θ(t), we obtain the Q(θ, θ(T)) of the E step of the EM algorithm. We can maximum the Q(θ, θ(T)) with each parameters to obtain the optimal value of the θ for the next step. This maximization can be achieved by differentiating Q(θ, θ(T)) by each θi. Q(θ, θ(t)) = Eθ(t)[logP(θ)|X = x] = θT Eθ(t)[S(x, π)|X = x] − logC(θ) + logh(x, π) (3) ∂ ∂θi Q(θ, θ(t)) = Eθ(t)[Si(X, Ψ)|X = x] − Eθ[Si(X, Ψ)] (4) The results can be shown in Eq.(5). Here Eq.(5) can be rewritten as Eq.(6), to give a clear prescription for the updates in the M step. Eθ(t)[Si(X, Ψ)|X = x] = Eθ[Si(X, Ψ)] (5) Eθ(t−1)[Si(X, Ψ)|X = x] = Eθ(t)[Si(X, Ψ)] (6) Here Si is a sufficient statistics. Ψ is the latent parameter, and π is its estimate. In each iteration, θ(t) should be solved with θ(t − 1) and the given data. For the mixed normal model, we have P(θ, Ψ|X)) = N i=1 Pθ(xi, πi) = e K k=1 βk( N i=1 zik)− 1 2 K k=1( N i=1 zikxT i Σ−1 k xi)+ K k=1 N i=1(zikxT i Σ−1 k µk) (7) βk = Log(πk) + 1 2 Log|Σ−1 k | − 1 2 µT k Σ−1 k µk (8) Thus we can read off the sufficient statistics: N i=1 Ψik, N i=1 ΨikXiXT i , N i=1 ΨikXT i (9) 1
  • 2. Plugging each of the above sufficient statistics into Eq.(8) should yield the desired results. However, here I want to derive the updating equations directly. The log-likelihood can be shown as LogP(θ, Ψ|X)) = N i=1 K k=1 (βk − 1 2 xT i Σ−1 k xi + xT i Σ−1 k µk)zik (10) where zik = I(xi = Ck). Then, we can calculate the Q(θ, θ(t)) as follows: Q(θ, θ(t)) = EΨ|θ(t)[LogP(θ, Ψ|X)] = N i=1 K k=1 (βk − 1 2 xT i Σ−1 k xi + xT i Σ−1 k µk)EΨ|θ(t)[δ(xi = Ck)] (11) where EΨ|θ(t)[δ(xi = Ck)] = P(xi = Ck|θ(t) , Xi) = N(Xi|µk, Σk)π (t) ik K k=1 N(Xi|µk, Σk)π (t) ik = πik (12) Then we have Q(θ, θ(t)) = N i=1 K k=1 (βk − 1 2 xT i Σ−1 k xi + xT i Σ−1 k µk)πik (13) 1. Derivation of µk. The µ (t) k can be obtained by solving ∂Q(θ, θ(t)) ∂µk = 0 (14) This yields N i=1 K k =1 ∂[(xi − µk )T Σ−1 k (xi − µk )πik ] ∂µk = 0 (15) or N i=1 ∂[(xi − µk)T Σ−1 k (xi − µk)πik] ∂µk = 0 (16) According to vector calculus, the above equation can be simplified into N i=1 (xi − µk)πik = 0 (17) which ultimately gives µ (t) k = N i=1 π (t) ik xi/ N i=1 π (t) ik (18) 2
  • 3. 2. Derivation of Σk. The Σ (t) k can be obtained by solving ∂Q(θ, θ(t)) ∂Σk = 0 (19) which is equivalent to N i=1 K k =1 ∂[Log|Σ−1 k | − (xi − µk )T Σ−1 k (xi − µk )πik ] ∂Σk = 0 (20) or N i=1 ∂[Log|Σ−1 k | − (xi − µk)T Σ−1 k (xi − µk)πik] ∂Σk = 0 (21) This can be simplified into N i=1 [Σk − (xi − µk)(xi − µk)T ]πik = 0 (22) which ultimately gives Σ (t) k = N i=1 π (t) ik (xi − µ (t) k )(xi − µ (t) k )T / N i=1 π (t) ik (23) 3. Statistical Analysis. To study the performance of the EM method, we first need to come up with different types of training (sample) data. These data can be drawn from different mixture models (different πi, µi , Σi). The sample size of the data can also be different. The design will be shown in Section 3.1. To measure the performance, two parameters are of interest. The first is the absolute deviation of the converged value coming out of the EM algorithm, namely Dev(ˆπ1)=|ˆπ1 −π1|. (Here we consider a computation converges if ˆπ1[t] = ˆπ1[t − 9], or ten iterations in a row yield the same estimated values.) The second is the total number of iterations used to the convergence. In this project, every computational job takes less than 500 iterations and can be finished in less than 3 minutes. Under the knowledge that time efficiency is not a problem for the designed computational experiment, we mainly focus on the absolute deviation of the estimations. 3.1 Experimental Design To statistically analyze the performance of the EM method. I designed a full-factorial ex- periment with four main factors. The first factor (A) is the sample size, the second factor (B) is the weight (π1, π2) for sampling from the first or the second Gauss distribution, the 3
  • 4. third factor (C) specifies the µ1 , µ2 for the two Gauss distributions, and the fourth factor (D) specifies the Σ1, Σ2 for the two Gauss distributions. Computational experiment with all combinations of A,B,C and D are investigated, as shown in Table 2. From Run 1 to Run 16, each combination of the four factors are investigated. For each run of experiment, 6 replications are generated. Factor - vs + A Data Sample Size 100 vs 50 B π1, π2 0.1, 0.9 vs 0.5, 0.5 C µ1 , µ2 1 2 , 2 1 vs 1 5 , 5 1 D Σ1, Σ2 3 3 3 5 , 3 3 3 5 vs 3 −3 −3 5 , 3 −3 −3 5 Table 1: Factors and levels of the four-factor full factorial experiment design. With different combinations of the factors B,C and D, we can first have a look how the mixed distributions look like. Figure 1 shows the contour plots of the mixed distribution for Run 1 to Run 4, where π1 = 0.1, π2 = 0.9. Figure 2 shows the contour plots of the distribution for Run 5 to Run 8, where π1 = 0.5, π2 = 0.5. It can be easily found that only the distribution from the third plot of Figure 2 show two distinct modes. We expect the EM algorithm to function well in this case. For other cases, we need to run the code in order to reach an accurate conclusion. 3.2 Experiment Results and Analysis In each computation, the converged ˆπ1, ˆπ2, ˆµ1 , ˆµ2 , and ˆΣ1, ˆΣ2 are obtained. However, it is not necessary to check every one of them to study the performance of the EM algorithm. (When the value of one parameter stabilizes, values of all other parameters stabilize.) In Table II, I recorded the absolute deviation of the estimated ˆπ1’s in each replicates, which is defined as: |ˆπ1j − π1|, j=1,...,6. Taking average of the deviations among the 6 replicates, we can obtain the average of the absolute deviation, as defined by Dev(π1) = 6 j=1 |ˆπ1j − π1|, where j is the index for each replicate in a run. Look at the Dev(π1) values in each run from Table 2, we can easily find that the estimates are sometimes unacceptable. For example, in Run 2, Run 5, Run 6, Run 8, Run 9, Run 10, Run 13 and Run 16 have their Dev(π1) larger than 0.1. Since the estimates in half of the runs are not acceptable, the validity of the EM method is largely dependent on what the distribution the training data are drawn from, as well as the size of the training data. In the following, we will study the main effects of the four factors (related to the mix-model distribution and the sample size) on the Dev(π1) obtained by the EM method. From the Main Effects Plot in Figure 3, we can find the factor C show the largest main effect. Given data from mixed distributions with 1 5 , 5 1 , the average of the Dev(π1) become much smaller than that where data are drawn from distributions with 1 2 , 2 1 . 4
  • 5. Run ABCD Rep1 Rep2 Rep3 Rep4 Rep5 Rep6 Dev(π1) 1 - - - - 0.04476 0.16758 0.02807 0.01891 0.07494 0.08944 0.07061 2 - - - + 0.04059 0.07746 0.07168 0.21808 0.04810 0.16695 0.10381 3 - - + - 0.01506 0.00005 0.00000 0.00014 0.00000 0.00149 0.00279 4 - - + + 0.02462 0.06744 0.06470 0.01486 0.00929 0.02744 0.03472 5 - + - - 0.18725 0.13970 0.20257 0.07702 0.08960 0.04289 0.12317 6 - + - + 0.32204 0.38365 0.05723 0.19782 0.14263 0.10631 0.20161 7 - + + - 0.00014 0.00172 0.00070 0.00001 0.00000 0.00094 0.00059 8 - + + + 0.04974 0.15832 0.01410 0.12716 0.12529 0.13536 0.10166 9 + - - - 0.56811 0.56536 0.18880 0.23188 0.04571 0.10274 0.28377 10 + - - + 0.20829 0.49863 0.33953 0.39940 0.01038 0.09023 0.25774 11 + - + - 0.00000 0.00007 0.00000 0.00000 0.00001 0.00050 0.00010 12 + - + + 0.00617 0.04590 0.03875 0.03190 0.04809 0.01020 0.03017 13 + + - - 0.15656 0.30134 0.03449 0.04952 0.23108 0.25004 0.17051 14 + + - + 0.26834 0.06553 0.04031 0.12802 0.03216 0.00339 0.08962 15 + + + - 0.00000 0.00118 0.00000 0.00013 0.00004 0.01903 0.00340 16 + + + + 0.13104 0.21445 0.15344 0.30326 0.26638 0.01643 0.18083 Table 2: The four-factor full factorial design and the experiment results. The values under the Rep1 to Rep6 columns are absolute deviations (|ˆπ1j − π1|, j=1,...,6) as calculated in different replicates. The last column gives the averaged values of the absolute deviations as calculated in each run (denoted by Dev(π1) ). Or simply put, the farther the two mean vectors are separated, the more accurate (smaller averaged Dev(π1) ) the estimates of the EM method can be. Similarly, we can see for factor A, a larger sample size yields a bigger accuracy. And for factor D, distribution shapes (for each single normal distribution) that favor the overlap of the two normal distributions will result in difficulty of estimation, or larger Dev(π1) from the EM algorithm (check the 3rd and 4th plot of Figure 2). Next, we can study the interactions of different factors. In Figure 4 we show the in- teraction plots of the 6 pair of factors. Interestingly, the A (sample size) and B (π1, π2) are antagonistic. All other pairs show synergistic relationship. As calculated by the linear regression, the two factor interaction effects are: INT(A,B)=-0.042814, INT(A,C)=-0.028462, INT(A,D)=-0.018004, INT(B,C)=0.043716, INT(B,D)=0.025862, INT(C,D)=0.041974. Since none of these interaction effects looks significant, I want to have a thorough study of which effects are significant. The coefficients of the regression of yi = β0 + 15 j=1 βjxij + εi (where xi1 = Ai, xi2 = Bi, xi3 = Ci, xi4 = Di, xi5 = AiBi, xi6 = AiCi, xi7 = AiDi, xi8 = BiCi, xi9 = BiDi, xi10 = CiDi, xi11 = AiBiCi, xi12 = AiBiDi, xi13 = AiCiDi, xi14 = BiCiDi, xi15 = AiBiCiDi) will help us to calculate the all the 15 effects. The main effects are calculated as: 5
  • 6. ME(A)=0.047146, ME(B)=0.010960, ME(C)=-0.118324, ME(D)=0.043156. The three-factor interaction effects are calculated as: INT(A,B,C)=0.065122, INT(A,B,D)=-0.002734, INT(A,C,D)=0.036630, INT(B,C,D)=0.028266. And the four-factor interaction effect is: INT(A,B,C,D)= 0.022292. The first five largest effects include: ME(C)>INT(A,B,C)>ME(A)>INT(B,C)>ME(D). We can also use the Lenth’s method to test which effects are significant. The Lenth’s plot with significance level α = 0.1 is shown in Figure 5. From this figure we find that, at significance level of 0.1, only factor C (separation of the two mean vectors) is the effect that influences the value of Dev(π1) . In Figure 6, Lenth’s plot with α = 0.3 is given. Here the conclusion is that there are two significant effects, one is the factor C, and one is the interaction of A,B and C (the interaction of the sample size, the πi values and the µi values). One remark is that, not like factor A, C and D, the way we assign weights to each normal distributions (values of the mixing coefficients in factor B) is in itself relatively incapable to affect the value of Dev(π1) . 6
  • 7. Contour Plot µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = 0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = −0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = 0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = −0.7745967 0 1 2 3 4 5 6 0123456 Figure 1: Contour plots of the mixture normal distributions with π1 = 0.1, π2 = 0.9. Contour Plot µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = 0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 2 µ21 = 2 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = −0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = 0.7745967 0 1 2 3 4 5 6 0123456 Contour Plot µ11 = 1 µ12 = 5 µ21 = 5 µ22 = 1 σ1 2 = 3 σ2 2 = 5 ρ1 = −0.7745967 0 1 2 3 4 5 6 0123456 Figure 2: Contour plots of the mixture normal distributions with π1 = 0.5, π2 = 0.5. 7
  • 8. Main Effects Plot A B C D 0.040.080.120.16 Figure 3: The main effects plot for the four main factors: sample size, latent parameter true values, mean true values, and covariance matrix true values. The y-axis is the average of the Bias(ˆπ1) with a fixed level of the given main effect. 8
  • 9. 0.060.100.14 AB B meanofy −1 1 A 1 −1 0.050.15 AC C meanofy −1 1 A 1 −1 0.060.100.14 AD D meanofy −1 1 A 1 −1 0.050.15 BC C meanofy −1 1 B 1 −1 0.080.110.14 BD D meanofy −1 1 B 1 −1 0.000.10 CD D meanofy −1 1 C −1 1 Figure 4: The interaction effects plot for different combination of main factors. 9
  • 10. Lenth's Plot with Significance Level Alpha=0.1 factors effects A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD −0.100.000.10 ME ME SME SME Figure 5: Length’s Plot of all the main and interaction effects with α = 0.1. Lenth's Plot with Significance Level Alpha=0.3 factors effects A B C D AB AC AD BC BD CD ABC ABD ACD BCDABCD −0.100.00 ME ME SME Figure 6: Length’s Plot of all the main and interaction effects with α = 0.3. 10