SlideShare a Scribd company logo
1 of 55
Download to read offline
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Machine Learning
Mixture Model, HMM, and
Expectation Maximization
Eric Xing
Lecture 9, August 14, 2010
Reading:
Eric Xing © Eric Xing @ CMU, 2006-2010 2
 Data log-likelihood
 MLE
 What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
Gaussian Discriminative Analysis
zi
xi
N
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =
∑
∑
,
ˆ⇒
n
k
n
n n
k
n
MLEk
z
xz
=µ
{ }
{ }∑ −−
−−
==
'
2
'2
1
2/12'
2
2
1
2/12
)(exp
)2(
1
)(exp
)2(
1
),,|1(
2
2
k
knk
knk
n
k
nyp
µ
πσ
π
µ
πσ
π
σµ
σ
σ
x
x
x
Eric Xing © Eric Xing @ CMU, 2006-2010 3
Clustering
Eric Xing © Eric Xing @ CMU, 2006-2010 4
Unobserved Variables
 A variable can be unobserved (latent) because:
 it is an imaginary quantity meant to provide some simplified and abstractive view
of the data generation process
 e.g., speech recognition models, mixture models …
 it is a real-world object and/or phenomena, but difficult or impossible to measure
 e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
 it is a real-world object and/or phenomena, but sometimes wasn’t measured,
because of faulty sensors; or was measure with a noisy channel, etc.
 e.g., traffic radio, aircraft signal on a radar screen,
 Discrete latent variables can be used to partition/cluster data
into sub-groups (mixture models, forthcoming).
 Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc., later lectures).
Eric Xing © Eric Xing @ CMU, 2006-2010 5
Mixture Models
 A density model p(x) may be multi-modal.
 We may be able to model it as a mixture of uni-modal
distributions (e.g., Gaussians).
 Each mode may correspond to a different sub-population
(e.g., male and female).
⇒
Eric Xing © Eric Xing @ CMU, 2006-2010 6
Gaussian Mixture Models (GMMs)
 Consider a mixture of K Gaussian components:
 Z is a latent class indicator vector:
 X is a conditional Gaussian variable with a class-specific mean/covariance
 The likelihood of a sample:
( )∏):(multi)(
k
z
knn
k
n
zzp ππ ==
{ })-()-(-exp
)(
),,|( // knk
T
kn
k
m
k
nn xxzxp µµ
π
µ 1
2
1
212
2
1
1 −
Σ
Σ
=Σ=
( )( ) ∑∑ ∏
∑
Σ=Σ=
Σ===Σ
k kkkz k
z
kkn
z
k
k
kk
n
xNxN
zxpzpxp
n
k
n
k
n
),|,(),:(
),,|,()|(),(
µπµπ
µπµ 11
mixture proportion
mixture component
Z
X
Eric Xing © Eric Xing @ CMU, 2006-2010 7
Gaussian Mixture Models (GMMs)
 Consider a mixture of K Gaussian components:
 This model can be used for unsupervised clustering.
 This model (fit by AutoClass) has been used to discover new kinds of stars in
astronomical data, etc.
∑ Σ=Σ k kkkn xNxp ),|,(),( µπµ
mixture proportion mixture component
Eric Xing © Eric Xing @ CMU, 2006-2010 8
Learning mixture models
 Given data
 Likelihood:
( )∏ ∑∏ Σ=Σ=Σ
n
k kkk
n
n xNxpDL ),|,(),,();,,( µπµπµπ
{ } ),,,(maxarg**,*, DL Σ=Σ µπµπ
Eric Xing © Eric Xing @ CMU, 2006-2010 9
Why is Learning Harder?
 In fully observed iid settings, the log likelihood decomposes
into a sum of local terms.
 With latent variables, all the parameters become coupled
together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l
∑∑ ==
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 10
 Recall MLE for completely observed data
 Data log-likelihood
 MLE
 What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
Toward the EM algorithm
zi
xi
N
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =
∑
∑
,
ˆ⇒
n
k
n
n n
k
n
MLEk
z
xz
=µ
Eric Xing © Eric Xing @ CMU, 2006-2010 11
Recall K-means
 Start:
 "Guess" the centroid µk and coveriance Σk of each of the K clusters
 Loop
 For each point n=1 to N,
compute its cluster label:
 For each cluster k=1:K
)()(minarg )()(1)()( t
kn
t
k
Tt
kn
k
t
n xxz µµ −Σ−= −
∑
∑=+
n
t
n
n n
t
nt
k
kz
xkz
),(
),(
)(
)(
)(
δ
δ
µ 1
...)(
=Σ +1t
k
Eric Xing © Eric Xing @ CMU, 2006-2010 12
Expectation-Maximization
 Start:
 "Guess" the centroid µk and coveriance Σk of each of the K clusters
 Loop
Eric Xing © Eric Xing @ CMU, 2006-2010 13
─ Expectation step: computing the expected value of the
sufficient statistics of the hidden variables (i.e., z) given
current est. of the parameters (i.e., π and µ).
 Here we are essentially doing inference
∑ ),|,(
),|,(
),,|( )()()(
)()()(
)()()(
)(
i
t
i
t
in
t
i
t
k
t
kn
t
kttk
nq
k
n
tk
n
xN
xN
xzpz t
Σ
Σ
=Σ===
µπ
µπ
µτ 1
E-step
Zn
Xn
N
Eric Xing © Eric Xing @ CMU, 2006-2010 14
─ Maximization step: compute the parameters under
current results of the expected value of the hidden variables
 This is isomorphic to MLE except that the variables that are hidden are
replaced by their expectations (in general they will by replaced by their
corresponding "sufficient statistics")
M-step
Zn
Xn
N
⇒
s.t.,∀,)(⇒,)(maxarg
)(
*
k
∂
∂*
∑
∑
)(
N
n
NN
z
kll
kn
tk
nn q
k
n
k
kcck
t
k
===
===
∑ τ
π
ππ π 10θθ
∑
∑
)(
)(
)1(*
⇒,)(maxarg
n
tk
n
n n
tk
nt
kk
x
l
τ
τ
µµ == +
θ
∑
∑
)(
)()()(
)(*
))((
⇒,)(maxarg
n
tk
n
n
Tt
kn
t
kn
tk
nt
kk
xx
l
τ
µµτ 11
1
++
+
−−
=Σ=Σ θ
Eric Xing © Eric Xing @ CMU, 2006-2010 15
How is EM derived?
 A mixture of K Gaussians:
 Z is a latent class indicator vector
 X is a conditional Gaussian variable with a class-specific mean/covariance
 The likelihood of a sample:
 The “complete” likelihood
Zn
Xn
N
( )∏):(multi)(
k
z
knn
k
n
zzp ππ ==
{ })-()-(-exp
)(
),,|( // knk
T
kn
k
m
k
nn xxzxp µµ
π
µ 1
2
1
212
2
1
1 −
Σ
Σ
=Σ=
( )( ) ∑∑ ∏
∑
Σ=Σ=
Σ===Σ
k kkkz k
z
kkn
z
k
k
k
n
k
nn
xNxN
zxpzpxp
n
k
n
k
n
),|,(),:(
),,1|,()|1(),(
µπµπ
µπµ
),|,(),,1|,()|1(),1,( kkk
k
n
k
n
k
nn xNzxpzpzxp Σ=Σ===Σ= µπµπµ
But this is itself a random variable! Not good as objective function
[ ]∏ Σ=Σ
k
z
kkknn
k
n
xNzxp ),|,(),,( µπµ
Eric Xing © Eric Xing @ CMU, 2006-2010 16
How is EM derived?
 The complete log likelihood:
 The expected complete log likelihood
 We maximize iteratively using the above
iterative procedure:
Zn
Xn
N
( )∑∑∑∑
∑∑
log)()(
2
1
log
),,|(log)|(log),;( )|()|(
n k
kknk
T
kn
k
n
n k
k
k
n
n
xzpnn
n
xzpnc
Cxxzz
zxpzpzx
+Σ+−Σ−−=
Σ+=
−
µµπ
µπ
1
θl
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
)(θcl
Eric Xing © Eric Xing @ CMU, 2006-2010 17
Compare: K-means
 The EM algorithm for mixtures of Gaussians is like a "soft
version" of the K-means algorithm.
 In the K-means “E-step” we do hard assignment:
 In the K-means “M-step” we update the means as the
weighted sum of the data, but now the weights are 0 or 1:
)()(maxarg )()()()( t
kn
t
k
Tt
kn
k
t
n xxz µµ −Σ−= −1
∑
∑=+
n
t
n
n n
t
nt
k
kz
xkz
),(
),(
)(
)(
)(
δ
δ
µ 1








=+
∑
∑
)(
)(
)1(
n
tk
n
n n
tk
nt
k
x
τ
τ
µ
( ))(
)(
t
q
k
n
tk
n z=τ
Eric Xing © Eric Xing @ CMU, 2006-2010 18
Theory underlying EM
 What are we doing?
 Recall that according to MLE, we intend to learn the model
parameter that would have maximize the likelihood of the
data.
 But we do not observe z, so computing
is difficult!
 What shall we do?
∑∑ ==
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 19
Complete & Incomplete Log
Likelihoods
 Complete log likelihood
Let X denote the observable variable(s), and Z denote the latent variable(s).
If Z could be observed, then
 Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully
observed models).
 Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of
factors, the parameter for each factor can be estimated separately.
 But given that Z is not observed, lc() is a random quantity, cannot be
maximized directly.
 Incomplete log likelihood
With z unobserved, our objective becomes the log of a marginal probability:
 This objective won't decouple
)|,(log),;(
def
θθ zxpzxc =l
∑==
z
c zxpxpx )|,(log)|(log);( θθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 20
Expected Complete Log
Likelihood
∑=
z
qc zxpxzqzx )|,(log),|(),;(
def
θθθl
∑
∑
∑
≥
=
=
=
z
z
z
xzq
zxp
xzq
xzq
zxp
xzq
zxp
xpx
)|(
)|,(
log)|(
)|(
)|,(
)|(log
)|,(log
)|(log);(
θ
θ
θ
θθl
qqc Hzxx +≥⇒ ),;();( θθ ll
 For any distribution q(z), define expected complete log likelihood:
 A deterministic function of θ
 Linear in lc() --- inherit its factorizabiility
 Does maximizing this surrogate yield a maximizer of the likelihood?
 Jensen’s inequality
Eric Xing © Eric Xing @ CMU, 2006-2010 21
Lower Bounds and Free Energy
 For fixed data x, define a functional called the free energy:
 The EM algorithm is coordinate-ascent on F :
 E-step:
 M-step:
);(
)|(
)|,(
log)|(),(
def
x
xzq
zxp
xzqqF
z
θ
θ
θ l≤= ∑
),(maxarg t
q
t
qFq θ=+1
),(maxarg ttt
qF θθ
θ
11 ++
=
Eric Xing © Eric Xing @ CMU, 2006-2010 22
E-step: maximization of expected
lc w.r.t. q
 Claim:
 This is the posterior distribution over the latent variables given the data and the
parameters. Often we need this at test time anyway (e.g. to perform
classification).
 Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )
 Can also show this result using variational calculus or the fact
that
),|(),(maxarg tt
q
t
xzpqFq θθ ==+1
);()|(log
)|(log),(
),(
)|,(
log),()),,((
xxp
xpxzp
xzp
zxp
xzpxzpF
tt
z
tt
z
t
t
ttt
θθ
θθ
θ
θ
θθθ
l==
=
=
∑
∑
( )),|(||KL),();( θθθ xzpqqFx =−l
Eric Xing © Eric Xing @ CMU, 2006-2010 23
E-step ≡ plug in posterior
expectation of latent variables
 Without loss of generality: assume that p(x,z|θ) is a
generalized exponential family distribution:
 Special cases: if p(X|Z) are GLIMs, then
 The expected complete log likelihood under
is
)(),(
)()|,(log),|(),;(
),|(
θθ
θθθθ
θ
Azxf
Azxpxzqzx
i
xzqi
t
i
z
tt
q
t
c
t
t
−=
−=
∑
∑+1
l






= ∑i
ii zxfzxh
Z
zxp ),(exp),(
)(
),( θ
θ
θ
1
)()(),( xzzxf i
T
ii ξη=
),|( tt
xzpq θ=+1
)()()( ),|(
GLIM~
θξηθ θ
Axz
i
ixzqi
t
i
p
t −= ∑
Eric Xing © Eric Xing @ CMU, 2006-2010 24
M-step: maximization of expected
lc w.r.t. θ
 Note that the free energy breaks into two terms:
 The first term is the expected complete log likelihood (energy) and the second
term, which does not depend on θ, is the entropy.
 Thus, in the M-step, maximizing with respect to θ for fixed q
we only need to consider the first term:
 Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed
model p(x,z|θ), with the sufficient statistics involving z replaced by their
expectations w.r.t. p(z|x,θ).
qqc
zz
z
Hzx
xzqxzqzxpxzq
xzq
zxp
xzqqF
+=
−=
=
∑∑
∑
),;(
)|(log)|()|,(log)|(
)|(
)|,(
log)|(),(
θ
θ
θ
θ
l
∑== +
+
z
qc
t
zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθ
θθ
1
1
l
Eric Xing © Eric Xing @ CMU, 2006-2010 25
Summary: EM Algorithm
 A way of maximizing likelihood function for latent variable
models. Finds MLE of parameters when the original (hard)
problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current
parameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
 Alternate between filling in the latent variables using the best
guess (posterior) and updating the parameters based on this
guess:
 E-step:
 M-step:
 In the M-step we optimize a lower bound on the likelihood. In
the E-step we close the gap, making bound=likelihood.
),(maxarg t
q
t
qFq θ=+1
),(maxarg ttt
qF θθ
θ
11 ++
=
Eric Xing © Eric Xing @ CMU, 2006-2010 26
From static to dynamic mixture
models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
N
The sequence:
The underlying
source:
Phonemes,
Speech signal,
sequence of rolls,
dice,
Eric Xing © Eric Xing @ CMU, 2006-2010 27
Chromosomes of tumor cell:
Predicting Tumor Cell States
Eric Xing © Eric Xing @ CMU, 2006-2010 28
Copy number profile for chromosome
1 from 600 MPE cell line
Copy number profile for chromosome
8 from COLO320 cell line
60-70 fold amplification of CMYC region
Copy number profile for chromosome 8
in MDA-MB-231 cell line
deletion
DNA Copy number aberration
types in breast cancer
Eric Xing © Eric Xing @ CMU, 2006-2010 29
A real CGH run
Eric Xing © Eric Xing @ CMU, 2006-2010 30
Hidden Markov Model
 Observation space
Alphabetic set:
Euclidean space:
 Index set of hidden states
 Transition probabilities between any two states
or
 Start probabilities
 Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
{ }Kccc ,,, 21=C
d
R
{ }M,,, 21=I
,)|( ,ji
i
t
j
t ayyp === − 11 1
( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miii
i
tt 211 1
( ).,,,lMultinomia~)( Myp πππ 211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiii
i
tt 211
( ) .,|f~)|( I∈∀⋅= iyxp i
i
tt θ1
Graphical model
K
1
…
2
State automata
Eric Xing © Eric Xing @ CMU, 2006-2010 31
The Dishonest Casino
A casino has two dice:
 Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6
 Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10
P(6) = 1/2
Casino player switches back-&-forth
between fair and loaded die once every
20 turns
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die,
maybe with loaded die)
4. Highest number wins $2
Eric Xing © Eric Xing @ CMU, 2006-2010 32
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
The Dishonest Casino Model
Eric Xing © Eric Xing @ CMU, 2006-2010 33
Puzzles Regarding the Dishonest
Casino
GIVEN: A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
 How likely is this sequence, given our model of how the casino
works?
 This is the EVALUATION problem in HMMs
 What portion of the sequence was generated with the fair die, and
what portion with the loaded die?
 This is the DECODING question in HMMs
 How “loaded” is the loaded die? How “fair” is the fair die? How often
does the casino player change from fair to loaded, and back?
 This is the LEARNING question in HMMs
Eric Xing © Eric Xing @ CMU, 2006-2010 34
Joint Probability
1245526462146146136136661664661636616366163616515615115146123562344
Eric Xing © Eric Xing @ CMU, 2006-2010 35
Probability of a Parse
 Given a sequence x = x1……xT
and a parse y = y1, ……, yT,
 To find how likely is the parse:
(given our HMM and the sequence)
p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)
= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)
= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)
 Marginal probability:
 Posterior probability:
∑ ∑ ∑ ∑ ∏ ∏= =
−
== y
yxx
1 2 11
2 1
y y y
T
t
T
t
ttyyy
N tt
yxpapp )|(),()( ,π
)(/),()|( xyxxy ppp =
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
Eric Xing © Eric Xing @ CMU, 2006-2010 36
Example: the Dishonest Casino
 Let the sequence of rolls be:
 x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
 Then, what is the likelihood of
 y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
(say initial probs a0Fair = ½, aoLoaded = ½)
½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
½ × (1/6)10 × (0.95)9 = .00000000521158647211 = 5.21 × 10-9
Eric Xing © Eric Xing @ CMU, 2006-2010 37
Example: the Dishonest Casino
 So, the likelihood the die is fair in all this run
is just 5.21 × 10-9
 OK, but what is the likelihood of
 π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,
Loaded, Loaded, Loaded?
½ × P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) =
½ × (1/10)8 × (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 × 10-9
 Therefore, it is after all 6.59 times more likely that the die is fair
all the way, than that it is loaded all the way
Eric Xing © Eric Xing @ CMU, 2006-2010 38
Example: the Dishonest Casino
 Let the sequence of rolls be:
 x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
 Now, what is the likelihood π = F, F, …, F?
 ½ × (1/6)10 × (0.95)9 = 0.5 × 10-9, same as before
 What is the likelihood y = L, L, …, L?
½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 = 5 × 10-7
 So, it is 100 times more likely the die is loaded
Eric Xing © Eric Xing @ CMU, 2006-2010 39
Three Main Questions on HMMs
1. Evaluation
GIVEN an HMM M, and a sequence x,
FIND Prob (x | M)
ALGO. Forward
2. Decoding
GIVEN an HMM M, and a sequence x ,
FIND the sequence y of states that maximizes, e.g., P(y | x , M),
or the most probable subsequence of states
ALGO. Viterbi, Forward-backward
3. Learning
GIVEN an HMM M, with unspecified transition/emission probs.,
and a sequence x,
FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)
ALGO. Baum-Welch (EM)
Eric Xing © Eric Xing @ CMU, 2006-2010 40
Applications of HMMs
 Some early applications of HMMs
 finance, but we never saw them
 speech recognition
 modelling ion channels
 In the mid-late 1980s HMMs entered genetics and molecular
biology, and they are now firmly entrenched.
 Some current applications of HMMs to biology
 mapping chromosomes
 aligning biological sequences
 predicting sequence structure
 inferring evolutionary relationships
 finding genes in DNA sequence
Eric Xing © Eric Xing @ CMU, 2006-2010 41
The Forward Algorithm
 We want to calculate P(x), the likelihood of x, given the HMM
 Sum over all possible ways of generating x:
 To avoid summing over an exponential number of paths y, define
(the forward probability)
 The recursion:
),,...,()(
def
11 1 ==== k
tt
k
t
k
t yxxPy αα
∑ −==
i
ki
i
t
k
tt
k
t ayxp ,)|( 11 αα
∑=
k
k
TP α)(x
∑ ∑ ∑ ∑ ∏ ∏= =
−
== y
yxx
1 2 11
2 1
y y y
T
t
T
t
ttyyy
N tt
yxpapp )|(),()( ,π
Eric Xing © Eric Xing @ CMU, 2006-2010 42
The Forward Algorithm –
derivation
 Compute the forward probability:
),,,...,( 111 == −
k
ttt
k
t yxxxPα
),,...,,|(),...,,|(),,...,( 111111111 11
1
−−−−−− === ∑ −
tt
k
tttt
k
ty tt yxxyxPxxyyPyxxP
t
)|()|(),,...,( 11 1111
1
=== −−−∑ −
k
ttt
k
ty tt yxPyyPyxxP
t
)|(),,...,()|( 1111 1111 ===== −−−∑ i
t
k
ti
i
tt
k
tt yyPyxxPyxP
kii
i
t
k
tt ayxP ,)|( ∑ −== 11 α
AA xtx1
yty1 ...
Axt-1
yt-1
...
...
...
),|()|()(),,(:ruleChain BACPABPAPCBAP =
Eric Xing © Eric Xing @ CMU, 2006-2010 43
The Forward Algorithm
 We can compute for all k, t, using dynamic programming!
Initialization:
Iteration:
Termination:
k
tα
k
kk
yxP πα )|( 1111 ==
k
k
kk
kk
yxP
yPyxP
yxP
π
α
)|(
)()|(
),(
1
11
1
11
111
111
==
===
==
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
∑=
k
k
TP α)(x
Eric Xing © Eric Xing @ CMU, 2006-2010 44
The Backward Algorithm
 We want to compute ,
the posterior probability distribution on the
t th position, given x
 We start by computing
 The recursion:
)|( x1=k
tyP
Forward, αt
k
Backward,
),...,,,,...,(),( Tt
k
tt
k
t xxyxxPyP 11 11 +=== x
)|...()...(
),,...,|,...,(),,...,(
, 11
11
11
111
===
===
+
+
k
tTt
k
tt
k
ttTt
k
tt
yxxPyxxP
yxxxxPyxxP
)|,...,( 11 == +
k
tTt
k
t yxxPβ
∑ +++ ==
i
i
t
i
ttik
k
t yxpa 111, )1|( ββ
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...
Eric Xing © Eric Xing @ CMU, 2006-2010 45
Example:
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 111 1 +++ == ∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 46
Alpha (actual)
0.0833 0.0500
0.0136 0.0052
0.0022 0.0006
0.0004 0.0001
0.0001 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
Beta (actual)
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0001 0.0001
0.0007 0.0006
0.0045 0.0055
0.0264 0.0112
0.1633 0.1033
1.0000 1.0000
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 11 1 +++ == ∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 47
Alpha (logs)
-2.4849 -2.9957
-4.2969 -5.2655
-6.1201 -7.4896
-7.9499 -9.6553
-9.7834 -10.1454
-11.5905 -12.4264
-13.4110 -14.6657
-15.2391 -15.2407
-17.0310 -17.5432
-18.8430 -19.8129
Beta (logs)
-16.2439 -17.2014
-14.4185 -14.9922
-12.6028 -12.7337
-10.8042 -10.4389
-9.0373 -9.7289
-7.2181 -7.4833
-5.4135 -5.1977
-3.6352 -4.4938
-1.8120 -2.2698
0 0
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 11 1 +++ == ∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 48
What is the probability of a
hidden state prediction?
Eric Xing © Eric Xing @ CMU, 2006-2010 49
Posterior decoding
 We can now calculate
 Then, we can ask
 What is the most likely state at position t of sequence x:
 Note that this is an MPA of a single hidden state,
what if we want to a MPA of a whole hidden state sequence?
 Posterior Decoding:
 This is different from MPA of a whole sequence of hidden
states
 This can be understood as bit error rate
vs. word error rate
)()(
),(
)|(
xx
x
x
PP
yP
yP
k
t
k
t
k
tk
t
βα
=
=
==
1
1
)|(maxarg
*
x1== k
tkt yPk
{ }:
*
Tty tk
t 11 ==
Example:
MPA of X ?
MPA of (X, Y) ?
x y P(x,y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
Eric Xing © Eric Xing @ CMU, 2006-2010 50
Viterbi decoding
 GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that
P(y|x) is maximized:
y* = argmaxy P(y|x) = argmaxπ P(y,x)
 Let
= Probability of most likely sequence of states ending at state yt = k
 The recursion:
 Underflows are a significant problem
 These numbers become extremely small – underflow
 Solution: Take the logs of all values:
),,...,,,...,(max ,--},...{ -
1111111
== k
ttttyy
k
t yxyyxxPV t
i
tkii
k
tt
k
t VayxpV 11 −== ,max)|(
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
Vi(t)
k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,(  11121111 −
= π
( )( )i
tkii
k
tt
k
t VayxpV 11 −++== ,logmax)|(log
Eric Xing © Eric Xing @ CMU, 2006-2010 51
Computational Complexity and
implementation details
 What is the running time, and space required, for Forward,
and Backward?
Time: O(K2N); Space: O(KN).
 Useful implementation technique to avoid underflows
 Viterbi: sum of logs
 Forward/Backward: rescaling at each position by multiplying by a constant
∑ −==
i
ki
i
t
k
tt
k
t ayxp ,1)1|( αα
i
t
i
tt
i
ik
k
t yxpa 111, )1|( +++ == ∑ ββ
i
tkii
k
tt
k
t VayxpV 1,max)1|( −==
Eric Xing © Eric Xing @ CMU, 2006-2010 52
(Homework!)
Learning HMM
 Given x = x1…xN for which the true state path y = y1…yN is
known,
 Define:
Aij = # times state transition i→j occurs in y
Bik = # times state i in y emits k in x
 We can show that the maximum likelihood parameters θ are:
 What if y is continuous? We can treat as N×T
observations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑
∑ ∑ ==
•→
→
=
= −
= −
' ',
,,
)(#
)(#
j ij
ij
n
T
t
i
tn
j
tnn
T
t
i
tnML
ij
A
A
y
yy
i
ji
a
2 1
2 1
∑∑ ∑
∑ ∑ ==
•→
→
=
=
=
' ',
,,
)(#
)(#
k ik
ik
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
B
B
y
xy
i
ki
b
1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
(Homework!)
∏ ∏∏ 





==
==
−
n
T
t
tntn
T
t
tntnn xxpyypypp
1
,,
2
1,,1, )|()|()(log),(log),;( yxyxθl
Eric Xing © Eric Xing @ CMU, 2006-2010 53
Unsupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
unknown,
 EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters θ:
1. Estimate Aij , Bik in the training data
 How? , , How? (homework)
2. Update θ according to Aij , Bik
 Now a "supervised learning" problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set θ each iteration
k
tntn
i
tnik xyB ,, ,∑=∑ −= tn
j
tn
i
tnij yyA , ,, 1
Eric Xing © Eric Xing @ CMU, 2006-2010 54
The Baum Welch algorithm
 The complete log likelihood
 The expected complete log likelihood
 EM
 The E step
 The M step ("symbolically" identical to MLE)
∏ ∏∏ 





==
==
−
n
T
t
tntn
T
t
tntnnc xxpyypypp
12
11 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑ ==
− 




+




+




=
−
n
T
t
kiyp
i
tn
k
tn
n
T
t
ji
yyp
j
tn
i
tn
n
iyp
i
nc byxayyy
ntnntntnnn
12
11
11
,)|(,,,
)|,(
,,)|(, logloglog),;(
,,,, xxx
yxθ πl
)|( ,,, n
i
tn
i
tn
i
tn ypy x1===γ
)|,( ,,,,
,
, n
j
tn
i
tn
j
tn
i
tn
ji
tn yypyy x1111 ==== −−ξ
∑ ∑
∑ ∑
−
=
=
=
n
T
t
i
tn
n
T
t
ji
tnML
ija 1
1
2
,
,
,
γ
ξ
∑ ∑
∑ ∑
−
=
=
=
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
x
b 1
1
1
,
,,
γ
γ
N
n
i
nML
i
∑=
1,γ
π
Eric Xing © Eric Xing @ CMU, 2006-2010 55
Summary
 Modeling hidden transitional trajectories (in discrete state
space, such as cluster label, DNA copy number, dice-choice,
etc.) underlying observed sequence data (discrete, such as
dice outcomes; or continuous, such as CGH signals)
 Useful for parsing, segmenting sequential data
 Important HMM computations:
 The joint likelihood of a parse and data can be written as a product to local terms
(i.e., initial prob, transition prob, emission prob.)
 Computing marginal likelihood of the observed sequence: forward algorithm
 Predicting a single hidden state: forward-backward
 Predicting an entire sequence of hidden states: viterbi
 Learning HMM parameters: an EM algorithm known as Baum-Welch

More Related Content

What's hot

Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationAlexander Litvinenko
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescentXudong Sun
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014PyData
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Data sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionData sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionAlexander Litvinenko
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesFrank Nielsen
 
Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Alexander Litvinenko
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real timeHokusai - Sketching streams in real time
Hokusai - Sketching streams in real timeSergiy Matusevych
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 

What's hot (20)

Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimation
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescent
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Data sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionData sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansion
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real timeHokusai - Sketching streams in real time
Hokusai - Sketching streams in real time
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 

Viewers also liked

A Quick Introduction to the Chow Liu Algorithm
A Quick Introduction to the Chow Liu AlgorithmA Quick Introduction to the Chow Liu Algorithm
A Quick Introduction to the Chow Liu AlgorithmJee Vang, Ph.D.
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and morehsharmasshare
 

Viewers also liked (6)

A Quick Introduction to the Chow Liu Algorithm
A Quick Introduction to the Chow Liu AlgorithmA Quick Introduction to the Chow Liu Algorithm
A Quick Introduction to the Chow Liu Algorithm
 
2014 9-22
2014 9-222014 9-22
2014 9-22
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 

Similar to Lecture9 xing

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationAlexander Litvinenko
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...Joe Suzuki
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdfHyungjoo Cho
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyFrank Nielsen
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodelsDai-Hai Nguyen
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMWireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
Fixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesFixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesAlexander Decker
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
Pseudo Random Number Generators
Pseudo Random Number GeneratorsPseudo Random Number Generators
Pseudo Random Number GeneratorsDarshini Parikh
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 

Similar to Lecture9 xing (20)

Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...
 
sada_pres
sada_pressada_pres
sada_pres
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdf
 
Clustering-beamer.pdf
Clustering-beamer.pdfClustering-beamer.pdf
Clustering-beamer.pdf
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropy
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Fixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesFixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spaces
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Pseudo Random Number Generators
Pseudo Random Number GeneratorsPseudo Random Number Generators
Pseudo Random Number Generators
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 

More from Tianlu Wang

14 pro resolution
14 pro resolution14 pro resolution
14 pro resolutionTianlu Wang
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculusTianlu Wang
 
12 adversal search
12 adversal search12 adversal search
12 adversal searchTianlu Wang
 
11 alternative search
11 alternative search11 alternative search
11 alternative searchTianlu Wang
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculusTianlu Wang
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learningTianlu Wang
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidenceTianlu Wang
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledgeTianlu Wang
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systemsTianlu Wang
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based systemTianlu Wang
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolutionTianlu Wang
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolutionTianlu Wang
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic searchTianlu Wang
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed searchTianlu Wang
 

More from Tianlu Wang (20)

L7 er2
L7 er2L7 er2
L7 er2
 
L8 design1
L8 design1L8 design1
L8 design1
 
L9 design2
L9 design2L9 design2
L9 design2
 
14 pro resolution
14 pro resolution14 pro resolution
14 pro resolution
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculus
 
12 adversal search
12 adversal search12 adversal search
12 adversal search
 
11 alternative search
11 alternative search11 alternative search
11 alternative search
 
10 2 sum
10 2 sum10 2 sum
10 2 sum
 
22 planning
22 planning22 planning
22 planning
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learning
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidence
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledge
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systems
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based system
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolution
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolution
 
15 predicate
15 predicate15 predicate
15 predicate
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic search
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed search
 

Recently uploaded

FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiMalviyaNagarCallGirl
 
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineSHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineShivna Prakashan
 
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...akbard9823
 
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiMalviyaNagarCallGirl
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiMalviyaNagarCallGirl
 
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...anilsa9823
 
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Servicesonnydelhi1992
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtowndajasot375
 
San Jon Motel, Motel/Residence, San Jon NM
San Jon Motel, Motel/Residence, San Jon NMSan Jon Motel, Motel/Residence, San Jon NM
San Jon Motel, Motel/Residence, San Jon NMroute66connected
 
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comBridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comthephillipta
 
FULL ENJOY - 9953040155 Call Girls in Burari | Delhi
FULL ENJOY - 9953040155 Call Girls in Burari | DelhiFULL ENJOY - 9953040155 Call Girls in Burari | Delhi
FULL ENJOY - 9953040155 Call Girls in Burari | DelhiMalviyaNagarCallGirl
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad EscortsCall girls in Ahmedabad High profile
 
MinSheng Gaofeng Estate commercial storyboard
MinSheng Gaofeng Estate commercial storyboardMinSheng Gaofeng Estate commercial storyboard
MinSheng Gaofeng Estate commercial storyboardjessica288382
 
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...anilsa9823
 
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...anilsa9823
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiMalviyaNagarCallGirl
 
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiMalviyaNagarCallGirl
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...gurkirankumar98700
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...akbard9823
 

Recently uploaded (20)

FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
 
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineSHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
 
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
 
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
 
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...
Lucknow 💋 Cheap Call Girls In Lucknow Finest Escorts Service 8923113531 Avail...
 
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
 
San Jon Motel, Motel/Residence, San Jon NM
San Jon Motel, Motel/Residence, San Jon NMSan Jon Motel, Motel/Residence, San Jon NM
San Jon Motel, Motel/Residence, San Jon NM
 
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comBridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
 
FULL ENJOY - 9953040155 Call Girls in Burari | Delhi
FULL ENJOY - 9953040155 Call Girls in Burari | DelhiFULL ENJOY - 9953040155 Call Girls in Burari | Delhi
FULL ENJOY - 9953040155 Call Girls in Burari | Delhi
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
 
MinSheng Gaofeng Estate commercial storyboard
MinSheng Gaofeng Estate commercial storyboardMinSheng Gaofeng Estate commercial storyboard
MinSheng Gaofeng Estate commercial storyboard
 
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
 
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...
Lucknow 💋 Escorts Service Lucknow Phone No 8923113531 Elite Escort Service Av...
 
Dxb Call Girls # +971529501107 # Call Girls In Dxb Dubai || (UAE)
Dxb Call Girls # +971529501107 # Call Girls In Dxb Dubai || (UAE)Dxb Call Girls # +971529501107 # Call Girls In Dxb Dubai || (UAE)
Dxb Call Girls # +971529501107 # Call Girls In Dxb Dubai || (UAE)
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
 
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
 

Lecture9 xing

  • 1. Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Mixture Model, HMM, and Expectation Maximization Eric Xing Lecture 9, August 14, 2010 Reading:
  • 2. Eric Xing © Eric Xing @ CMU, 2006-2010 2  Data log-likelihood  MLE  What if we do not know zn? Cxzz xN zxpzpxzpD n k kn k n n k k k n n z k k n n k z k nn n n n nn k n k n += += == ∏ ∑∑∑∑ ∑ ∏∑ ∏ ∏ )-(-log ),;(loglog ),,|()|(log),(log);( 2 2 1 2 µπ σµπ σµπ σ θl Gaussian Discriminative Analysis zi xi N ),;(maxargˆ , DMLEk θlππ = );(maxargˆ , DMLEk θlµµ = );(maxargˆ , DMLEk θlσσ = ∑ ∑ , ˆ⇒ n k n n n k n MLEk z xz =µ { } { }∑ −− −− == ' 2 '2 1 2/12' 2 2 1 2/12 )(exp )2( 1 )(exp )2( 1 ),,|1( 2 2 k knk knk n k nyp µ πσ π µ πσ π σµ σ σ x x x
  • 3. Eric Xing © Eric Xing @ CMU, 2006-2010 3 Clustering
  • 4. Eric Xing © Eric Xing @ CMU, 2006-2010 4 Unobserved Variables  A variable can be unobserved (latent) because:  it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process  e.g., speech recognition models, mixture models …  it is a real-world object and/or phenomena, but difficult or impossible to measure  e.g., the temperature of a star, causes of a disease, evolutionary ancestors …  it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors; or was measure with a noisy channel, etc.  e.g., traffic radio, aircraft signal on a radar screen,  Discrete latent variables can be used to partition/cluster data into sub-groups (mixture models, forthcoming).  Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc., later lectures).
  • 5. Eric Xing © Eric Xing @ CMU, 2006-2010 5 Mixture Models  A density model p(x) may be multi-modal.  We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).  Each mode may correspond to a different sub-population (e.g., male and female). ⇒
  • 6. Eric Xing © Eric Xing @ CMU, 2006-2010 6 Gaussian Mixture Models (GMMs)  Consider a mixture of K Gaussian components:  Z is a latent class indicator vector:  X is a conditional Gaussian variable with a class-specific mean/covariance  The likelihood of a sample: ( )∏):(multi)( k z knn k n zzp ππ == { })-()-(-exp )( ),,|( // knk T kn k m k nn xxzxp µµ π µ 1 2 1 212 2 1 1 − Σ Σ =Σ= ( )( ) ∑∑ ∏ ∑ Σ=Σ= Σ===Σ k kkkz k z kkn z k k kk n xNxN zxpzpxp n k n k n ),|,(),:( ),,|,()|(),( µπµπ µπµ 11 mixture proportion mixture component Z X
  • 7. Eric Xing © Eric Xing @ CMU, 2006-2010 7 Gaussian Mixture Models (GMMs)  Consider a mixture of K Gaussian components:  This model can be used for unsupervised clustering.  This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc. ∑ Σ=Σ k kkkn xNxp ),|,(),( µπµ mixture proportion mixture component
  • 8. Eric Xing © Eric Xing @ CMU, 2006-2010 8 Learning mixture models  Given data  Likelihood: ( )∏ ∑∏ Σ=Σ=Σ n k kkk n n xNxpDL ),|,(),,();,,( µπµπµπ { } ),,,(maxarg**,*, DL Σ=Σ µπµπ
  • 9. Eric Xing © Eric Xing @ CMU, 2006-2010 9 Why is Learning Harder?  In fully observed iid settings, the log likelihood decomposes into a sum of local terms.  With latent variables, all the parameters become coupled together via marginalization ),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l ∑∑ == z xz z c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
  • 10. Eric Xing © Eric Xing @ CMU, 2006-2010 10  Recall MLE for completely observed data  Data log-likelihood  MLE  What if we do not know zn? Cxzz xN zxpzpxzpD n k kn k n n k k k n n z k k n n k z k nn n n n nn k n k n += += == ∏ ∑∑∑∑ ∑ ∏∑ ∏ ∏ )-(-log ),;(loglog ),,|()|(log),(log);( 2 2 1 2 µπ σµπ σµπ σ θl Toward the EM algorithm zi xi N ),;(maxargˆ , DMLEk θlππ = );(maxargˆ , DMLEk θlµµ = );(maxargˆ , DMLEk θlσσ = ∑ ∑ , ˆ⇒ n k n n n k n MLEk z xz =µ
  • 11. Eric Xing © Eric Xing @ CMU, 2006-2010 11 Recall K-means  Start:  "Guess" the centroid µk and coveriance Σk of each of the K clusters  Loop  For each point n=1 to N, compute its cluster label:  For each cluster k=1:K )()(minarg )()(1)()( t kn t k Tt kn k t n xxz µµ −Σ−= − ∑ ∑=+ n t n n n t nt k kz xkz ),( ),( )( )( )( δ δ µ 1 ...)( =Σ +1t k
  • 12. Eric Xing © Eric Xing @ CMU, 2006-2010 12 Expectation-Maximization  Start:  "Guess" the centroid µk and coveriance Σk of each of the K clusters  Loop
  • 13. Eric Xing © Eric Xing @ CMU, 2006-2010 13 ─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., π and µ).  Here we are essentially doing inference ∑ ),|,( ),|,( ),,|( )()()( )()()( )()()( )( i t i t in t i t k t kn t kttk nq k n tk n xN xN xzpz t Σ Σ =Σ=== µπ µπ µτ 1 E-step Zn Xn N
  • 14. Eric Xing © Eric Xing @ CMU, 2006-2010 14 ─ Maximization step: compute the parameters under current results of the expected value of the hidden variables  This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics") M-step Zn Xn N ⇒ s.t.,∀,)(⇒,)(maxarg )( * k ∂ ∂* ∑ ∑ )( N n NN z kll kn tk nn q k n k kcck t k === === ∑ τ π ππ π 10θθ ∑ ∑ )( )( )1(* ⇒,)(maxarg n tk n n n tk nt kk x l τ τ µµ == + θ ∑ ∑ )( )()()( )(* ))(( ⇒,)(maxarg n tk n n Tt kn t kn tk nt kk xx l τ µµτ 11 1 ++ + −− =Σ=Σ θ
  • 15. Eric Xing © Eric Xing @ CMU, 2006-2010 15 How is EM derived?  A mixture of K Gaussians:  Z is a latent class indicator vector  X is a conditional Gaussian variable with a class-specific mean/covariance  The likelihood of a sample:  The “complete” likelihood Zn Xn N ( )∏):(multi)( k z knn k n zzp ππ == { })-()-(-exp )( ),,|( // knk T kn k m k nn xxzxp µµ π µ 1 2 1 212 2 1 1 − Σ Σ =Σ= ( )( ) ∑∑ ∏ ∑ Σ=Σ= Σ===Σ k kkkz k z kkn z k k k n k nn xNxN zxpzpxp n k n k n ),|,(),:( ),,1|,()|1(),( µπµπ µπµ ),|,(),,1|,()|1(),1,( kkk k n k n k nn xNzxpzpzxp Σ=Σ===Σ= µπµπµ But this is itself a random variable! Not good as objective function [ ]∏ Σ=Σ k z kkknn k n xNzxp ),|,(),,( µπµ
  • 16. Eric Xing © Eric Xing @ CMU, 2006-2010 16 How is EM derived?  The complete log likelihood:  The expected complete log likelihood  We maximize iteratively using the above iterative procedure: Zn Xn N ( )∑∑∑∑ ∑∑ log)()( 2 1 log ),,|(log)|(log),;( )|()|( n k kknk T kn k n n k k k n n xzpnn n xzpnc Cxxzz zxpzpzx +Σ+−Σ−−= Σ+= − µµπ µπ 1 θl Cxzz xN zxpzpxzpD n k kn k n n k k k n n z k k n n k z k nn n n n nn k n k n += += == ∏ ∑∑∑∑ ∑ ∏∑ ∏ ∏ )-(-log ),;(loglog ),,|()|(log),(log);( 2 2 1 2 µπ σµπ σµπ σ θl )(θcl
  • 17. Eric Xing © Eric Xing @ CMU, 2006-2010 17 Compare: K-means  The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.  In the K-means “E-step” we do hard assignment:  In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1: )()(maxarg )()()()( t kn t k Tt kn k t n xxz µµ −Σ−= −1 ∑ ∑=+ n t n n n t nt k kz xkz ),( ),( )( )( )( δ δ µ 1         =+ ∑ ∑ )( )( )1( n tk n n n tk nt k x τ τ µ ( ))( )( t q k n tk n z=τ
  • 18. Eric Xing © Eric Xing @ CMU, 2006-2010 18 Theory underlying EM  What are we doing?  Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.  But we do not observe z, so computing is difficult!  What shall we do? ∑∑ == z xz z c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
  • 19. Eric Xing © Eric Xing @ CMU, 2006-2010 19 Complete & Incomplete Log Likelihoods  Complete log likelihood Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then  Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).  Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.  But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.  Incomplete log likelihood With z unobserved, our objective becomes the log of a marginal probability:  This objective won't decouple )|,(log),;( def θθ zxpzxc =l ∑== z c zxpxpx )|,(log)|(log);( θθθl
  • 20. Eric Xing © Eric Xing @ CMU, 2006-2010 20 Expected Complete Log Likelihood ∑= z qc zxpxzqzx )|,(log),|(),;( def θθθl ∑ ∑ ∑ ≥ = = = z z z xzq zxp xzq xzq zxp xzq zxp xpx )|( )|,( log)|( )|( )|,( )|(log )|,(log )|(log);( θ θ θ θθl qqc Hzxx +≥⇒ ),;();( θθ ll  For any distribution q(z), define expected complete log likelihood:  A deterministic function of θ  Linear in lc() --- inherit its factorizabiility  Does maximizing this surrogate yield a maximizer of the likelihood?  Jensen’s inequality
  • 21. Eric Xing © Eric Xing @ CMU, 2006-2010 21 Lower Bounds and Free Energy  For fixed data x, define a functional called the free energy:  The EM algorithm is coordinate-ascent on F :  E-step:  M-step: );( )|( )|,( log)|(),( def x xzq zxp xzqqF z θ θ θ l≤= ∑ ),(maxarg t q t qFq θ=+1 ),(maxarg ttt qF θθ θ 11 ++ =
  • 22. Eric Xing © Eric Xing @ CMU, 2006-2010 22 E-step: maximization of expected lc w.r.t. q  Claim:  This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).  Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )  Can also show this result using variational calculus or the fact that ),|(),(maxarg tt q t xzpqFq θθ ==+1 );()|(log )|(log),( ),( )|,( log),()),,(( xxp xpxzp xzp zxp xzpxzpF tt z tt z t t ttt θθ θθ θ θ θθθ l== = = ∑ ∑ ( )),|(||KL),();( θθθ xzpqqFx =−l
  • 23. Eric Xing © Eric Xing @ CMU, 2006-2010 23 E-step ≡ plug in posterior expectation of latent variables  Without loss of generality: assume that p(x,z|θ) is a generalized exponential family distribution:  Special cases: if p(X|Z) are GLIMs, then  The expected complete log likelihood under is )(),( )()|,(log),|(),;( ),|( θθ θθθθ θ Azxf Azxpxzqzx i xzqi t i z tt q t c t t −= −= ∑ ∑+1 l       = ∑i ii zxfzxh Z zxp ),(exp),( )( ),( θ θ θ 1 )()(),( xzzxf i T ii ξη= ),|( tt xzpq θ=+1 )()()( ),|( GLIM~ θξηθ θ Axz i ixzqi t i p t −= ∑
  • 24. Eric Xing © Eric Xing @ CMU, 2006-2010 24 M-step: maximization of expected lc w.r.t. θ  Note that the free energy breaks into two terms:  The first term is the expected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy.  Thus, in the M-step, maximizing with respect to θ for fixed q we only need to consider the first term:  Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|θ), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,θ). qqc zz z Hzx xzqxzqzxpxzq xzq zxp xzqqF += −= = ∑∑ ∑ ),;( )|(log)|()|,(log)|( )|( )|,( log)|(),( θ θ θ θ l ∑== + + z qc t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθ θθ 1 1 l
  • 25. Eric Xing © Eric Xing @ CMU, 2006-2010 25 Summary: EM Algorithm  A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some “missing” or “unobserved” data from observed data and current parameters. 2. Using this “complete” data, find the maximum likelihood parameter estimates.  Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:  E-step:  M-step:  In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood. ),(maxarg t q t qFq θ=+1 ),(maxarg ttt qF θθ θ 11 ++ =
  • 26. Eric Xing © Eric Xing @ CMU, 2006-2010 26 From static to dynamic mixture models Dynamic mixture A AA AX2 X3X1 XT Y2 Y3Y1 YT... ... Static mixture AX1 Y1 N The sequence: The underlying source: Phonemes, Speech signal, sequence of rolls, dice,
  • 27. Eric Xing © Eric Xing @ CMU, 2006-2010 27 Chromosomes of tumor cell: Predicting Tumor Cell States
  • 28. Eric Xing © Eric Xing @ CMU, 2006-2010 28 Copy number profile for chromosome 1 from 600 MPE cell line Copy number profile for chromosome 8 from COLO320 cell line 60-70 fold amplification of CMYC region Copy number profile for chromosome 8 in MDA-MB-231 cell line deletion DNA Copy number aberration types in breast cancer
  • 29. Eric Xing © Eric Xing @ CMU, 2006-2010 29 A real CGH run
  • 30. Eric Xing © Eric Xing @ CMU, 2006-2010 30 Hidden Markov Model  Observation space Alphabetic set: Euclidean space:  Index set of hidden states  Transition probabilities between any two states or  Start probabilities  Emission probabilities associated with each state or in general: A AA Ax2 x3x1 xT y2 y3y1 yT... ... { }Kccc ,,, 21=C d R { }M,,, 21=I ,)|( ,ji i t j t ayyp === − 11 1 ( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miii i tt 211 1 ( ).,,,lMultinomia~)( Myp πππ 211 ( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiii i tt 211 ( ) .,|f~)|( I∈∀⋅= iyxp i i tt θ1 Graphical model K 1 … 2 State automata
  • 31. Eric Xing © Eric Xing @ CMU, 2006-2010 31 The Dishonest Casino A casino has two dice:  Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6  Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2
  • 32. Eric Xing © Eric Xing @ CMU, 2006-2010 32 FAIR LOADED 0.05 0.05 0.950.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 The Dishonest Casino Model
  • 33. Eric Xing © Eric Xing @ CMU, 2006-2010 33 Puzzles Regarding the Dishonest Casino GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION  How likely is this sequence, given our model of how the casino works?  This is the EVALUATION problem in HMMs  What portion of the sequence was generated with the fair die, and what portion with the loaded die?  This is the DECODING question in HMMs  How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back?  This is the LEARNING question in HMMs
  • 34. Eric Xing © Eric Xing @ CMU, 2006-2010 34 Joint Probability 1245526462146146136136661664661636616366163616515615115146123562344
  • 35. Eric Xing © Eric Xing @ CMU, 2006-2010 35 Probability of a Parse  Given a sequence x = x1……xT and a parse y = y1, ……, yT,  To find how likely is the parse: (given our HMM and the sequence) p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability) = p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT) = p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)  Marginal probability:  Posterior probability: ∑ ∑ ∑ ∑ ∏ ∏= = − == y yxx 1 2 11 2 1 y y y T t T t ttyyy N tt yxpapp )|(),()( ,π )(/),()|( xyxxy ppp = A AA Ax2 x3x1 xT y2 y3y1 yT... ...
  • 36. Eric Xing © Eric Xing @ CMU, 2006-2010 36 Example: the Dishonest Casino  Let the sequence of rolls be:  x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4  Then, what is the likelihood of  y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, aoLoaded = ½) ½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ × (1/6)10 × (0.95)9 = .00000000521158647211 = 5.21 × 10-9
  • 37. Eric Xing © Eric Xing @ CMU, 2006-2010 37 Example: the Dishonest Casino  So, the likelihood the die is fair in all this run is just 5.21 × 10-9  OK, but what is the likelihood of  π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ × P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) = ½ × (1/10)8 × (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 × 10-9  Therefore, it is after all 6.59 times more likely that the die is fair all the way, than that it is loaded all the way
  • 38. Eric Xing © Eric Xing @ CMU, 2006-2010 38 Example: the Dishonest Casino  Let the sequence of rolls be:  x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6  Now, what is the likelihood π = F, F, …, F?  ½ × (1/6)10 × (0.95)9 = 0.5 × 10-9, same as before  What is the likelihood y = L, L, …, L? ½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 = 5 × 10-7  So, it is 100 times more likely the die is loaded
  • 39. Eric Xing © Eric Xing @ CMU, 2006-2010 39 Three Main Questions on HMMs 1. Evaluation GIVEN an HMM M, and a sequence x, FIND Prob (x | M) ALGO. Forward 2. Decoding GIVEN an HMM M, and a sequence x , FIND the sequence y of states that maximizes, e.g., P(y | x , M), or the most probable subsequence of states ALGO. Viterbi, Forward-backward 3. Learning GIVEN an HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ) ALGO. Baum-Welch (EM)
  • 40. Eric Xing © Eric Xing @ CMU, 2006-2010 40 Applications of HMMs  Some early applications of HMMs  finance, but we never saw them  speech recognition  modelling ion channels  In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.  Some current applications of HMMs to biology  mapping chromosomes  aligning biological sequences  predicting sequence structure  inferring evolutionary relationships  finding genes in DNA sequence
  • 41. Eric Xing © Eric Xing @ CMU, 2006-2010 41 The Forward Algorithm  We want to calculate P(x), the likelihood of x, given the HMM  Sum over all possible ways of generating x:  To avoid summing over an exponential number of paths y, define (the forward probability)  The recursion: ),,...,()( def 11 1 ==== k tt k t k t yxxPy αα ∑ −== i ki i t k tt k t ayxp ,)|( 11 αα ∑= k k TP α)(x ∑ ∑ ∑ ∑ ∏ ∏= = − == y yxx 1 2 11 2 1 y y y T t T t ttyyy N tt yxpapp )|(),()( ,π
  • 42. Eric Xing © Eric Xing @ CMU, 2006-2010 42 The Forward Algorithm – derivation  Compute the forward probability: ),,,...,( 111 == − k ttt k t yxxxPα ),,...,,|(),...,,|(),,...,( 111111111 11 1 −−−−−− === ∑ − tt k tttt k ty tt yxxyxPxxyyPyxxP t )|()|(),,...,( 11 1111 1 === −−−∑ − k ttt k ty tt yxPyyPyxxP t )|(),,...,()|( 1111 1111 ===== −−−∑ i t k ti i tt k tt yyPyxxPyxP kii i t k tt ayxP ,)|( ∑ −== 11 α AA xtx1 yty1 ... Axt-1 yt-1 ... ... ... ),|()|()(),,(:ruleChain BACPABPAPCBAP =
  • 43. Eric Xing © Eric Xing @ CMU, 2006-2010 43 The Forward Algorithm  We can compute for all k, t, using dynamic programming! Initialization: Iteration: Termination: k tα k kk yxP πα )|( 1111 == k k kk kk yxP yPyxP yxP π α )|( )()|( ),( 1 11 1 11 111 111 == === == kii i t k tt k t ayxP ,)|( ∑ −== 11 αα ∑= k k TP α)(x
  • 44. Eric Xing © Eric Xing @ CMU, 2006-2010 44 The Backward Algorithm  We want to compute , the posterior probability distribution on the t th position, given x  We start by computing  The recursion: )|( x1=k tyP Forward, αt k Backward, ),...,,,,...,(),( Tt k tt k t xxyxxPyP 11 11 +=== x )|...()...( ),,...,|,...,(),,...,( , 11 11 11 111 === === + + k tTt k tt k ttTt k tt yxxPyxxP yxxxxPyxxP )|,...,( 11 == + k tTt k t yxxPβ ∑ +++ == i i t i ttik k t yxpa 111, )1|( ββ A Axt+1 xT yt+1 yT... Axt yt ... ... ...
  • 45. Eric Xing © Eric Xing @ CMU, 2006-2010 45 Example: FAIR LOADED 0.05 0.05 0.950.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 kii i t k tt k t ayxP ,)|( ∑ −== 11 αα i t i tti ik k t yxPa 111 1 +++ == ∑ ββ )|(,
  • 46. Eric Xing © Eric Xing @ CMU, 2006-2010 46 Alpha (actual) 0.0833 0.0500 0.0136 0.0052 0.0022 0.0006 0.0004 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Beta (actual) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0007 0.0006 0.0045 0.0055 0.0264 0.0112 0.1633 0.1033 1.0000 1.0000 FAIR LOADED 0.05 0.05 0.950.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 kii i t k tt k t ayxP ,)|( ∑ −== 11 αα i t i tti ik k t yxPa 11 1 +++ == ∑ ββ )|(,
  • 47. Eric Xing © Eric Xing @ CMU, 2006-2010 47 Alpha (logs) -2.4849 -2.9957 -4.2969 -5.2655 -6.1201 -7.4896 -7.9499 -9.6553 -9.7834 -10.1454 -11.5905 -12.4264 -13.4110 -14.6657 -15.2391 -15.2407 -17.0310 -17.5432 -18.8430 -19.8129 Beta (logs) -16.2439 -17.2014 -14.4185 -14.9922 -12.6028 -12.7337 -10.8042 -10.4389 -9.0373 -9.7289 -7.2181 -7.4833 -5.4135 -5.1977 -3.6352 -4.4938 -1.8120 -2.2698 0 0 FAIR LOADED 0.05 0.05 0.950.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 kii i t k tt k t ayxP ,)|( ∑ −== 11 αα i t i tti ik k t yxPa 11 1 +++ == ∑ ββ )|(,
  • 48. Eric Xing © Eric Xing @ CMU, 2006-2010 48 What is the probability of a hidden state prediction?
  • 49. Eric Xing © Eric Xing @ CMU, 2006-2010 49 Posterior decoding  We can now calculate  Then, we can ask  What is the most likely state at position t of sequence x:  Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?  Posterior Decoding:  This is different from MPA of a whole sequence of hidden states  This can be understood as bit error rate vs. word error rate )()( ),( )|( xx x x PP yP yP k t k t k tk t βα = = == 1 1 )|(maxarg * x1== k tkt yPk { }: * Tty tk t 11 == Example: MPA of X ? MPA of (X, Y) ? x y P(x,y) 0 0 0.35 0 1 0.05 1 0 0.3 1 1 0.3
  • 50. Eric Xing © Eric Xing @ CMU, 2006-2010 50 Viterbi decoding  GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized: y* = argmaxy P(y|x) = argmaxπ P(y,x)  Let = Probability of most likely sequence of states ending at state yt = k  The recursion:  Underflows are a significant problem  These numbers become extremely small – underflow  Solution: Take the logs of all values: ),,...,,,...,(max ,--},...{ - 1111111 == k ttttyy k t yxyyxxPV t i tkii k tt k t VayxpV 11 −== ,max)|( x1 x2 x3 ……………………...……..xN State 1 2 K x1 x2 x3 ……………………...……..xN State 1 2 K x1 x2 x3 ……………………...……..xN State 1 2 K x1 x2 x3 ……………………...……..xN State 1 2 K Vi(t) k tV tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,(  11121111 − = π ( )( )i tkii k tt k t VayxpV 11 −++== ,logmax)|(log
  • 51. Eric Xing © Eric Xing @ CMU, 2006-2010 51 Computational Complexity and implementation details  What is the running time, and space required, for Forward, and Backward? Time: O(K2N); Space: O(KN).  Useful implementation technique to avoid underflows  Viterbi: sum of logs  Forward/Backward: rescaling at each position by multiplying by a constant ∑ −== i ki i t k tt k t ayxp ,1)1|( αα i t i tt i ik k t yxpa 111, )1|( +++ == ∑ ββ i tkii k tt k t VayxpV 1,max)1|( −==
  • 52. Eric Xing © Eric Xing @ CMU, 2006-2010 52 (Homework!) Learning HMM  Given x = x1…xN for which the true state path y = y1…yN is known,  Define: Aij = # times state transition i→j occurs in y Bik = # times state i in y emits k in x  We can show that the maximum likelihood parameters θ are:  What if y is continuous? We can treat as N×T observations of, e.g., a Gaussian, and apply learning rules for Gaussian … ∑∑ ∑ ∑ ∑ == •→ → = = − = − ' ', ,, )(# )(# j ij ij n T t i tn j tnn T t i tnML ij A A y yy i ji a 2 1 2 1 ∑∑ ∑ ∑ ∑ == •→ → = = = ' ', ,, )(# )(# k ik ik n T t i tn k tnn T t i tnML ik B B y xy i ki b 1 1 ( ){ }NnTtyx tntn :,::, ,, 11 == (Homework!) ∏ ∏∏       == == − n T t tntn T t tntnn xxpyypypp 1 ,, 2 1,,1, )|()|()(log),(log),;( yxyxθl
  • 53. Eric Xing © Eric Xing @ CMU, 2006-2010 53 Unsupervised ML estimation  Given x = x1…xN for which the true state path y = y1…yN is unknown,  EXPECTATION MAXIMIZATION 0. Starting with our best guess of a model M, parameters θ: 1. Estimate Aij , Bik in the training data  How? , , How? (homework) 2. Update θ according to Aij , Bik  Now a "supervised learning" problem 3. Repeat 1 & 2, until convergence This is called the Baum-Welch Algorithm We can get to a provably more (or equally) likely parameter set θ each iteration k tntn i tnik xyB ,, ,∑=∑ −= tn j tn i tnij yyA , ,, 1
  • 54. Eric Xing © Eric Xing @ CMU, 2006-2010 54 The Baum Welch algorithm  The complete log likelihood  The expected complete log likelihood  EM  The E step  The M step ("symbolically" identical to MLE) ∏ ∏∏       == == − n T t tntn T t tntnnc xxpyypypp 12 11 )|()|()(log),(log),;( ,,,,,yxyxθl ∑∑∑∑∑ == −      +     +     = − n T t kiyp i tn k tn n T t ji yyp j tn i tn n iyp i nc byxayyy ntnntntnnn 12 11 11 ,)|(,,, )|,( ,,)|(, logloglog),;( ,,,, xxx yxθ πl )|( ,,, n i tn i tn i tn ypy x1===γ )|,( ,,,, , , n j tn i tn j tn i tn ji tn yypyy x1111 ==== −−ξ ∑ ∑ ∑ ∑ − = = = n T t i tn n T t ji tnML ija 1 1 2 , , , γ ξ ∑ ∑ ∑ ∑ − = = = n T t i tn k tnn T t i tnML ik x b 1 1 1 , ,, γ γ N n i nML i ∑= 1,γ π
  • 55. Eric Xing © Eric Xing @ CMU, 2006-2010 55 Summary  Modeling hidden transitional trajectories (in discrete state space, such as cluster label, DNA copy number, dice-choice, etc.) underlying observed sequence data (discrete, such as dice outcomes; or continuous, such as CGH signals)  Useful for parsing, segmenting sequential data  Important HMM computations:  The joint likelihood of a parse and data can be written as a product to local terms (i.e., initial prob, transition prob, emission prob.)  Computing marginal likelihood of the observed sequence: forward algorithm  Predicting a single hidden state: forward-backward  Predicting an entire sequence of hidden states: viterbi  Learning HMM parameters: an EM algorithm known as Baum-Welch