1. Project 2. The EM Algorithm.
0. The expression of πi,k
For a distribution that belongs to exponential family, we have:
Pθ(x, z) = eθT S(x,π)−logC(θ)
h(x, π) (1)
Taking the logarithm of Eq.(1), we have:
logPθ(x, z) = θT
S(x, π) − logC(θ) + logh(x, π) (2)
Take the conditional expectation with the sample (fixed) data x and with the current knowl-
edge of parameters of the distribution θ(t), we obtain the Q(θ, θ(T)) of the E step of the EM
algorithm.
We can maximum the Q(θ, θ(T)) with each parameters to obtain the optimal value of
the θ for the next step. This maximization can be achieved by differentiating Q(θ, θ(T)) by
each θi.
Q(θ, θ(t)) = Eθ(t)[logP(θ)|X = x] = θT
Eθ(t)[S(x, π)|X = x] − logC(θ) + logh(x, π) (3)
∂
∂θi
Q(θ, θ(t)) = Eθ(t)[Si(X, Ψ)|X = x] − Eθ[Si(X, Ψ)] (4)
The results can be shown in Eq.(5). Here Eq.(5) can be rewritten as Eq.(6), to give a clear
prescription for the updates in the M step.
Eθ(t)[Si(X, Ψ)|X = x] = Eθ[Si(X, Ψ)] (5)
Eθ(t−1)[Si(X, Ψ)|X = x] = Eθ(t)[Si(X, Ψ)] (6)
Here Si is a sufficient statistics. Ψ is the latent parameter, and π is its estimate. In each
iteration, θ(t) should be solved with θ(t − 1) and the given data.
For the mixed normal model, we have
P(θ, Ψ|X)) =
N
i=1
Pθ(xi, πi) = e
K
k=1 βk( N
i=1 zik)− 1
2
K
k=1( N
i=1 zikxT
i Σ−1
k xi)+ K
k=1
N
i=1(zikxT
i Σ−1
k µk)
(7)
βk = Log(πk) +
1
2
Log|Σ−1
k | −
1
2
µT
k Σ−1
k µk (8)
Thus we can read off the sufficient statistics:
N
i=1
Ψik,
N
i=1
ΨikXiXT
i ,
N
i=1
ΨikXT
i (9)
1
2. Plugging each of the above sufficient statistics into Eq.(8) should yield the desired results.
However, here I want to derive the updating equations directly. The log-likelihood can be
shown as
LogP(θ, Ψ|X)) =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)zik (10)
where zik = I(xi = Ck). Then, we can calculate the Q(θ, θ(t)) as follows:
Q(θ, θ(t)) = EΨ|θ(t)[LogP(θ, Ψ|X)] =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)EΨ|θ(t)[δ(xi = Ck)]
(11)
where
EΨ|θ(t)[δ(xi = Ck)] = P(xi = Ck|θ(t)
, Xi) =
N(Xi|µk, Σk)π
(t)
ik
K
k=1 N(Xi|µk, Σk)π
(t)
ik
= πik (12)
Then we have
Q(θ, θ(t)) =
N
i=1
K
k=1
(βk −
1
2
xT
i Σ−1
k xi + xT
i Σ−1
k µk)πik (13)
1. Derivation of µk.
The µ
(t)
k can be obtained by solving
∂Q(θ, θ(t))
∂µk
= 0 (14)
This yields
N
i=1
K
k =1
∂[(xi − µk )T
Σ−1
k (xi − µk )πik ]
∂µk
= 0 (15)
or
N
i=1
∂[(xi − µk)T
Σ−1
k (xi − µk)πik]
∂µk
= 0 (16)
According to vector calculus, the above equation can be simplified into
N
i=1
(xi − µk)πik = 0 (17)
which ultimately gives
µ
(t)
k =
N
i=1
π
(t)
ik xi/
N
i=1
π
(t)
ik (18)
2
3. 2. Derivation of Σk.
The Σ
(t)
k can be obtained by solving
∂Q(θ, θ(t))
∂Σk
= 0 (19)
which is equivalent to
N
i=1
K
k =1
∂[Log|Σ−1
k | − (xi − µk )T
Σ−1
k (xi − µk )πik ]
∂Σk
= 0 (20)
or
N
i=1
∂[Log|Σ−1
k | − (xi − µk)T
Σ−1
k (xi − µk)πik]
∂Σk
= 0 (21)
This can be simplified into
N
i=1
[Σk − (xi − µk)(xi − µk)T
]πik = 0 (22)
which ultimately gives
Σ
(t)
k =
N
i=1
π
(t)
ik (xi − µ
(t)
k )(xi − µ
(t)
k )T
/
N
i=1
π
(t)
ik (23)
3. Statistical Analysis.
To study the performance of the EM method, we first need to come up with different types
of training (sample) data. These data can be drawn from different mixture models (different
πi, µi
, Σi). The sample size of the data can also be different. The design will be shown in
Section 3.1.
To measure the performance, two parameters are of interest. The first is the absolute
deviation of the converged value coming out of the EM algorithm, namely Dev(ˆπ1)=|ˆπ1 −π1|.
(Here we consider a computation converges if ˆπ1[t] = ˆπ1[t − 9], or ten iterations in a row
yield the same estimated values.) The second is the total number of iterations used to the
convergence. In this project, every computational job takes less than 500 iterations and can
be finished in less than 3 minutes. Under the knowledge that time efficiency is not a problem
for the designed computational experiment, we mainly focus on the absolute deviation of the
estimations.
3.1 Experimental Design
To statistically analyze the performance of the EM method. I designed a full-factorial ex-
periment with four main factors. The first factor (A) is the sample size, the second factor
(B) is the weight (π1, π2) for sampling from the first or the second Gauss distribution, the
3
4. third factor (C) specifies the µ1
, µ2
for the two Gauss distributions, and the fourth factor
(D) specifies the Σ1, Σ2 for the two Gauss distributions. Computational experiment with all
combinations of A,B,C and D are investigated, as shown in Table 2. From Run 1 to Run
16, each combination of the four factors are investigated. For each run of experiment, 6
replications are generated.
Factor - vs +
A Data Sample Size 100 vs 50
B π1, π2 0.1, 0.9 vs 0.5, 0.5
C µ1
, µ2
1
2
,
2
1
vs
1
5
,
5
1
D Σ1, Σ2
3 3
3 5
,
3 3
3 5
vs
3 −3
−3 5
,
3 −3
−3 5
Table 1: Factors and levels of the four-factor full factorial experiment design.
With different combinations of the factors B,C and D, we can first have a look how the
mixed distributions look like. Figure 1 shows the contour plots of the mixed distribution
for Run 1 to Run 4, where π1 = 0.1, π2 = 0.9. Figure 2 shows the contour plots of the
distribution for Run 5 to Run 8, where π1 = 0.5, π2 = 0.5. It can be easily found that only
the distribution from the third plot of Figure 2 show two distinct modes. We expect the EM
algorithm to function well in this case. For other cases, we need to run the code in order to
reach an accurate conclusion.
3.2 Experiment Results and Analysis
In each computation, the converged ˆπ1, ˆπ2, ˆµ1
, ˆµ2
, and ˆΣ1, ˆΣ2 are obtained. However, it is
not necessary to check every one of them to study the performance of the EM algorithm.
(When the value of one parameter stabilizes, values of all other parameters stabilize.) In
Table II, I recorded the absolute deviation of the estimated ˆπ1’s in each replicates, which is
defined as: |ˆπ1j − π1|, j=1,...,6. Taking average of the deviations among the 6 replicates, we
can obtain the average of the absolute deviation, as defined by Dev(π1) = 6
j=1 |ˆπ1j − π1|,
where j is the index for each replicate in a run.
Look at the Dev(π1) values in each run from Table 2, we can easily find that the
estimates are sometimes unacceptable. For example, in Run 2, Run 5, Run 6, Run 8, Run 9,
Run 10, Run 13 and Run 16 have their Dev(π1) larger than 0.1. Since the estimates in half
of the runs are not acceptable, the validity of the EM method is largely dependent on what
the distribution the training data are drawn from, as well as the size of the training data.
In the following, we will study the main effects of the four factors (related to the mix-model
distribution and the sample size) on the Dev(π1) obtained by the EM method.
From the Main Effects Plot in Figure 3, we can find the factor C show the largest main
effect. Given data from mixed distributions with
1
5
,
5
1
, the average of the Dev(π1)
become much smaller than that where data are drawn from distributions with
1
2
,
2
1
.
4
5. Run ABCD Rep1 Rep2 Rep3 Rep4 Rep5 Rep6 Dev(π1)
1 - - - - 0.04476 0.16758 0.02807 0.01891 0.07494 0.08944 0.07061
2 - - - + 0.04059 0.07746 0.07168 0.21808 0.04810 0.16695 0.10381
3 - - + - 0.01506 0.00005 0.00000 0.00014 0.00000 0.00149 0.00279
4 - - + + 0.02462 0.06744 0.06470 0.01486 0.00929 0.02744 0.03472
5 - + - - 0.18725 0.13970 0.20257 0.07702 0.08960 0.04289 0.12317
6 - + - + 0.32204 0.38365 0.05723 0.19782 0.14263 0.10631 0.20161
7 - + + - 0.00014 0.00172 0.00070 0.00001 0.00000 0.00094 0.00059
8 - + + + 0.04974 0.15832 0.01410 0.12716 0.12529 0.13536 0.10166
9 + - - - 0.56811 0.56536 0.18880 0.23188 0.04571 0.10274 0.28377
10 + - - + 0.20829 0.49863 0.33953 0.39940 0.01038 0.09023 0.25774
11 + - + - 0.00000 0.00007 0.00000 0.00000 0.00001 0.00050 0.00010
12 + - + + 0.00617 0.04590 0.03875 0.03190 0.04809 0.01020 0.03017
13 + + - - 0.15656 0.30134 0.03449 0.04952 0.23108 0.25004 0.17051
14 + + - + 0.26834 0.06553 0.04031 0.12802 0.03216 0.00339 0.08962
15 + + + - 0.00000 0.00118 0.00000 0.00013 0.00004 0.01903 0.00340
16 + + + + 0.13104 0.21445 0.15344 0.30326 0.26638 0.01643 0.18083
Table 2: The four-factor full factorial design and the experiment results. The values under
the Rep1 to Rep6 columns are absolute deviations (|ˆπ1j − π1|, j=1,...,6) as calculated in
different replicates. The last column gives the averaged values of the absolute deviations as
calculated in each run (denoted by Dev(π1) ).
Or simply put, the farther the two mean vectors are separated, the more accurate (smaller
averaged Dev(π1) ) the estimates of the EM method can be. Similarly, we can see for factor
A, a larger sample size yields a bigger accuracy. And for factor D, distribution shapes (for
each single normal distribution) that favor the overlap of the two normal distributions will
result in difficulty of estimation, or larger Dev(π1) from the EM algorithm (check the 3rd
and 4th plot of Figure 2).
Next, we can study the interactions of different factors. In Figure 4 we show the in-
teraction plots of the 6 pair of factors. Interestingly, the A (sample size) and B (π1, π2)
are antagonistic. All other pairs show synergistic relationship. As calculated by the linear
regression, the two factor interaction effects are:
INT(A,B)=-0.042814,
INT(A,C)=-0.028462,
INT(A,D)=-0.018004,
INT(B,C)=0.043716,
INT(B,D)=0.025862,
INT(C,D)=0.041974.
Since none of these interaction effects looks significant, I want to have a thorough study
of which effects are significant. The coefficients of the regression of yi = β0 + 15
j=1 βjxij + εi
(where xi1 = Ai, xi2 = Bi, xi3 = Ci, xi4 = Di, xi5 = AiBi, xi6 = AiCi, xi7 = AiDi, xi8 =
BiCi, xi9 = BiDi, xi10 = CiDi, xi11 = AiBiCi, xi12 = AiBiDi, xi13 = AiCiDi, xi14 = BiCiDi,
xi15 = AiBiCiDi) will help us to calculate the all the 15 effects.
The main effects are calculated as:
5
6. ME(A)=0.047146,
ME(B)=0.010960,
ME(C)=-0.118324,
ME(D)=0.043156.
The three-factor interaction effects are calculated as:
INT(A,B,C)=0.065122,
INT(A,B,D)=-0.002734,
INT(A,C,D)=0.036630,
INT(B,C,D)=0.028266.
And the four-factor interaction effect is:
INT(A,B,C,D)= 0.022292.
The first five largest effects include: ME(C)>INT(A,B,C)>ME(A)>INT(B,C)>ME(D).
We can also use the Lenth’s method to test which effects are significant. The Lenth’s
plot with significance level α = 0.1 is shown in Figure 5. From this figure we find that, at
significance level of 0.1, only factor C (separation of the two mean vectors) is the effect that
influences the value of Dev(π1) . In Figure 6, Lenth’s plot with α = 0.3 is given. Here
the conclusion is that there are two significant effects, one is the factor C, and one is the
interaction of A,B and C (the interaction of the sample size, the πi values and the µi
values).
One remark is that, not like factor A, C and D, the way we assign weights to each normal
distributions (values of the mixing coefficients in factor B) is in itself relatively incapable to
affect the value of Dev(π1) .
6
8. Main Effects Plot
A B C D
0.040.080.120.16
Figure 3: The main effects plot for the four main factors: sample size, latent parameter true
values, mean true values, and covariance matrix true values. The y-axis is the average of the
Bias(ˆπ1) with a fixed level of the given main effect.
8
10. Lenth's Plot with Significance Level Alpha=0.1
factors
effects
A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD
−0.100.000.10
ME
ME
SME
SME
Figure 5: Length’s Plot of all the main and interaction effects with α = 0.1.
Lenth's Plot with Significance Level Alpha=0.3
factors
effects
A B C D AB AC AD BC BD CD ABC ABD ACD BCDABCD
−0.100.00
ME
ME
SME
Figure 6: Length’s Plot of all the main and interaction effects with α = 0.3.
10