Successfully reported this slideshow.
Upcoming SlideShare
×

# Hidden Markov Models

• Full Name
Comment goes here.

Are you sure you want to Yes No

• Be the first to like this

### Hidden Markov Models

2. 2. A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, s2 t=0, t=1, … On the t’th timestep the system is in exactly one of the available Current State states. Call it qt s1 s3 Note: qt ∈{s1, s2 .. sN } N=3 t=0 qt=q0=s3 Copyright © Andrew W. Moore Slide 3 A Markov System Has N states, called s1, s2 .. sN Current State There are discrete timesteps, s2 t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt s1 s3 Note: qt ∈{s1, s2 .. sN } N=3 Between each timestep, the next state is chosen randomly. t=1 qt=q1=s2 Copyright © Andrew W. Moore Slide 4 2
3. 3. P(qt+1=s1|qt=s2) = 1/2 A Markov System P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, P(qt+1=s1|qt=s1) = 0 s2 t=0, t=1, … P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 On the t’th timestep the system is in exactly one of the available states. Call it qt s1 s3 Note: qt ∈{s1, s2 .. sN } N=3 Between each timestep, the next state is chosen randomly. P(qt+1=s1|qt=s3) = 1/3 t=1 P(qt+1=s2|qt=s3) = 2/3 The current state determines the qt=q1=s2 P(qt+1=s3|qt=s3) = 0 probability distribution for the next state. Copyright © Andrew W. Moore Slide 5 P(qt+1=s1|qt=s2) = 1/2 A Markov System P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, P(qt+1=s1|qt=s1) = 0 s2 t=0, t=1, … 1/2 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 On the t’th timestep the system is 2/3 in exactly one of the available 1/2 states. Call it qt s1 s3 1/3 Note: qt ∈{s1, s2 .. sN } 1 N=3 Between each timestep, the next state is chosen randomly. P(qt+1=s1|qt=s3) = 1/3 t=1 P(qt+1=s2|qt=s3) = 2/3 The current state determines the qt=q1=s2 P(qt+1=s3|qt=s3) = 0 probability distribution for the next state. Often notated with arcs between states Copyright © Andrew W. Moore Slide 6 3
4. 4. P(qt+1=s1|qt=s2) = 1/2 Markov Property P(qt+1=s2|qt=s2) = 1/2 qt+1 is conditionally independent P(qt+1=s3|qt=s2) = 0 of { qt-1, qt-2, … q1, q0 } given qt. P(qt+1=s1|qt=s1) = 0 s2 In other words: 1/2 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 P(qt+1 = sj |qt = si ) = 2/3 1/2 P(qt+1 = sj |qt = si ,any earlier history) s1 s3 Question: what would be the best 1/3 Bayes Net structure to represent 1 N=3 the Joint Distribution of ( q0, q1, … q3,q4 )? P(qt+1=s1|qt=s3) = 1/3 t=1 P(qt+1=s2|qt=s3) = 2/3 qt=q1=s2 P(qt+1=s3|qt=s3) = 0 Copyright © Andrew W. Moore Slide 7 P(qt+1=s1|qt=s2) = 1/2 Markov Property Answer: P(qt+1=s2|qt=s2) = 1/2 q qt+1 is conditionally independent 0 P(qt+1=s3|qt=s2) = 0 of { qt-1, qt-2, … q1, q0 } given qt. P(qt+1=s1|qt=s1) = 0 s2 In other words: 1/2 q1 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 P(qt+1 = sj |qt = si ) = 2/3 1/2 P(qt+1 = sj |qt = si ,any earlier history) q1/3 s1 2 s3 Question: what would be the best Bayes Net structure to represent 1 N=3 the Joint Distribution of ( q0, q1, q q2,q3,q4 )? 3 P(qt+1=s1|qt=s3) = 1/3 t=1 P(qt+1=s2|qt=s3) = 2/3 qt=q1=s2 P(qt+1=s3|qt=s3) = 0 q4 Copyright © Andrew W. Moore Slide 8 4
5. 5. P(qt+1=s1|qt=s2) = 1/2 Markov Property Answer: P(qt+1=s2|qt=s2) = 1/2 q qt+1 is conditionally independent 0 P(qt+1=s3|qt=s2) = 0 of { qt-1, qt-2,… P(qt+11, j|q =si)} … P(qt+1=sq|q.=si) … q=s q0 given N t i P(qt+1=s1|qt=si) P(qt+1=s2|qt=si) t t P(qt+1=s1|qt=s1) = 0 s2 11 1 …a …a a a12 1j 1N In other words: 1/2 q1 P(qt+1=s2|qt=s1) = 0 2 … …a a21 a22 a2j 2N P(qt+1=s3|qt=s1) = 1 P(qt+1 = sj |q…= 3j i ) = t as 3 …a a31 a32 3N :2/3 1/2 : : :: :: P(qt+1 = sj |q…= si ,any earlier history) t q1/3 i … ai1 ai2 aij aiN s1 2 s3 Question: what would be the best Each of these Bayes Net structure to … aNN represent N …a aN1 aN2 1 Nj probability N=3 the Joint Distribution of ( q0, q1, q tables is q2,q3,q4 )? 3 P(qt+1=s1|qt=s3) = 1/3 identical t=1 P(qt+1=s2|qt=s3) = 2/3 Notation: qt=q1=s2 P(qt+1=s3|qt=s3) = 0 aij = P(qt +1 = s j | qt = si ) q4 Copyright © Andrew W. Moore Slide 9 A Blind Robot A human and a robot wander around randomly on a grid… R H . N (num Note: * ) = 18 states Location of Robot, 24 STATE q = 18 = 3 Location of Human Copyright © Andrew W. Moore Slide 10 5
6. 6. Dynamics of System Each timestep the human moves q0 = randomly to an R adjacent cell. And Robot also moves randomly to an H adjacent cell. Typical Questions: • “What’s the expected time until the human is crushed like a bug?” • “What’s the probability that the robot will hit the left wall before it hits the human?” • “What’s the probability Robot crushes human on next time step?” Copyright © Andrew W. Moore Slide 11 Example Question “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We’ll do this first We can compute this in advance. If robot is omnipotent: Too Easy. We (I.E. If robot knows state at time t), won’t do this can compute directly. If robot has some sensors, but Main Body incomplete state information … of Lecture Hidden Markov Models are applicable! Copyright © Andrew W. Moore Slide 12 6
7. 7. What is P(qt =s)? slow, stupid answer Step 1: Work out how to compute P(Q) for any path Q = q1 q2 q3 .. qt Given we know the start state q1 (i.e. P(q1)=1) P(q1 q2 .. qt) = P(q1 q2 .. qt-1) P(qt|q1 q2 .. qt-1) = P(q1 q2 .. qt-1) P(qt|qt-1) WHY? = P(q2|q1)P(q3|q2)…P(qt|qt-1) Step 2: Use this knowledge to get P(qt =s) ion is putat Com ntial in t ∑ P(Q) P(qt = s ) = ne expo Q∈Paths of length t that end in s Copyright © Andrew W. Moore Slide 13 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition ∀i p0 (i ) = ∀j pt +1 ( j ) = P (qt +1 = s j ) = Copyright © Andrew W. Moore Slide 14 7
8. 8. What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition ⎧1 if si is the start state ∀i p0 (i ) = ⎨ ⎩0 otherwise ∀j pt +1 ( j ) = P (qt +1 = s j ) = Copyright © Andrew W. Moore Slide 15 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition ⎧1 if si is the start state ∀i p0 (i ) = ⎨ ⎩0 otherwise ∀j pt +1 ( j ) = P (qt +1 = s j ) = N ∑ P(q = s j ∧ qt = si ) = t +1 i =1 Copyright © Andrew W. Moore Slide 16 8
9. 9. What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition ⎧1 if si is the start state ∀i p0 (i ) = ⎨ ⎩0 otherwise ∀j pt +1 ( j ) = P (qt +1 = s j ) = Remember, N ∑ P(q = s j ∧ qt = si ) = aij = P(qt +1 = s j | qt = si ) t +1 i =1 N N ∑ P(q ∑a = s j | qt = si ) P(qt = si ) = pt (i ) t +1 ij i =1 i =1 Copyright © Andrew W. Moore Slide 17 What is P(qt =s) ? Clever answer • Computation is simple. • For each state si, define • Just fill in this table in this pt(i) = Prob. state is si at time t order: = P(qt = si) pt(1) pt(2) … • Easy to do inductive definition t pt(N) ⎧1 if si is the start state 0 0 1 0 ∀i p0 (i ) = ⎨ 1 ⎩0 otherwise : ∀j pt +1 ( j ) = P (qt +1 = s j ) = tfinal N ∑ P(q = s j ∧ qt = si ) = t +1 i =1 N N ∑ P(qt +1 = s j | qt = si ) P(qt = si ) = ∑a pt (i ) ij i =1 i =1 Copyright © Andrew W. Moore Slide 18 9
10. 10. What is P(qt =s) ? Clever answer • For each state si, define • Cost of computing Pt(i) for all pt(i) = Prob. state is si at time t states Si is now O(t N2) = P(qt = si) • The stupid way was O(Nt) • This was a simple example • Easy to do inductive definition • It was meant to warm you up ⎧1 if si is the start state ∀i p0 (i ) = ⎨ to this trick, called Dynamic ⎩0 otherwise Programming, because HMMs do many tricks like ∀j pt +1 ( j ) = P (qt +1 = s j ) = this. N ∑ P(q = s j ∧ qt = si ) = t +1 i =1 N N ∑ P(q ∑a = s j | qt = si ) P(qt = si ) = pt (i ) t +1 ij i =1 i =1 Copyright © Andrew W. Moore Slide 19 Hidden State “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We’ll do this first We can compute this in advance. If robot is omnipotent: Too Easy. We (I.E. If robot knows state at time t), won’t do this can compute directly. If robot has some sensors, but Main Body incomplete state information … of Lecture Hidden Markov Models are applicable! Copyright © Andrew W. Moore Slide 20 10
11. 11. Hidden State • The previous example tried to estimate P(qt = si) unconditionally (using no observed evidence). • Suppose we can observe something that’s affected by the true state. • Example: Proximity sensors. (tell us the contents of the 8 adjacent squares) R0 W W W W ® H denotes “WALL” H True state qt What the robot sees: Observation Ot Copyright © Andrew W. Moore Slide 21 Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 W W W W ® H denotes “WALL” H True state qt Uncorrupted Observation W W ® W H H What the robot sees: Observation Ot Copyright © Andrew W. Moore Slide 22 11
12. 12. Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 2 W W W W ® H denotes “WALL” H True state qt Uncorrupted Observation Ot is noisily determined depending on the current state. W W ® W Assume that Ot is conditionally independent of {qt-1, qt-2, … q1, q0 ,Ot-1, H H Ot-2, … O1, O0 } given qt. What the robot sees: Observation Ot In other words: P(Ot = X |qt = si ) = P(Ot = X |qt = si ,any earlier history) Copyright © Andrew W. Moore Slide 23 Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 2 W W W W ® H denotes “WALL” H True state qt Uncorrupted Observation Ot is noisily determined depending on the current state. W W ® W Assume that Ot is conditionally independent of {qt-1, qt-2, … q1, q0 ,Ot-1, H H Ot-2, … O1, O0 } given qt. What the robot sees: Observation Ot In other words: Question: what’d be the best Bayes Net P(Ot = X |qt = si ) = structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? Copyright © Andrew W. Moore Slide 24 12
13. 13. Noisy Hidden State Answer: q0 • Example: Noisy Proximity sensors. (unreliably tell us O0 the contents of the 8 adjacent squares) q1 R0 2 W W W O1 W ® H denotes “WALL” H True state qt q2 Uncorrupted Observation O2 Ot is noisily determined depending on the current state. W W ® W Assume that Ot is conditionally q3 independent of {qt-1, qt-2, … q1, q0 ,Ot-1, H H O Ot-2, … O1, O0 } given3qt. What the robot sees: Observation Ot In other words: P(Ot = q4 t = si ) = Question: what’d be the best Bayes Net X |q O4 structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? Copyright © Andrew W. Moore Slide 25 Noisy Hidden State Answer: Notation: q0 bi (k = P (Ot = k | t = i ) • Example: Noisy Proximity sensors. )(unreliablyqtell sus O0 the contents of the 8i adjacent squares) =k|q =s ) … P(O =M|q =s ) P(O =1|q =s ) P(O =2|q =s ) … P(O t i t i t i t i t t t t q1 … … 1 b1(1) b1 (2) b1 (k) b1(M) R0 2 W W W O1 W … … 2 b2 (1) b2 (2) b2(k) b2 (M) ® H denotes … b (k) … 3 b3 (1) b3 (2) b3 (M) “WALL” H 3 :: : : : : : True state qt q2 determined depending on Uncorrupted Observation O2 Ot is noisily … … i bi(1) bi (2) bi(k) bi (M) the current state. :: : : : : : W W … … N ® bN (M) W conditionally bN (1) bN (2) bN(k) Assume that Ot is q3 independent of {qt-1, qt-2, … q1, q0 ,Ot-1, H H O Ot-2, … O1, O0 } given3qt. What the robot sees: Observation Ot In other words: P(Ot = q4 t = si ) = Question: what’d be the best Bayes Net X |q O4 structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? Copyright © Andrew W. Moore Slide 26 13
14. 14. Hidden Markov Models Our robot with noisy sensors is a good example of an HMM • Question 1: State Estimation What is P(qT=Si | O1O2…OT) It will turn out that a new cute D.P. trick will get this for us. • Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. • Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm Copyright © Andrew W. Moore Slide 27 Are H.M.M.s Useful? You bet !! • Robot planning + sensing when there’s uncertainty (e.g. Reid Simmons / Sebastian Thrun / Sven Koenig) • Speech Recognition/Understanding Phones → Words, Signal → phones • Human Genome Project Complicated stuff your lecturer knows nothing about. • Consumer decision modeling • Economics & Finance. Plus at least 5 other things I haven’t thought of. Copyright © Andrew W. Moore Slide 28 14
15. 15. Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 29 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 30 15
16. 16. Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 31 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 32 16
17. 17. Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 33 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Woke up at 8.35, Got on Bus at 9.46, Question 2: Most Probable Path Sat in lecture 10.05-11.22… Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 34 17
18. 18. Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Copyright © Andrew W. Moore Slide 35 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…OT) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Copyright © Andrew W. Moore Slide 36 18
19. 19. Some Famous HMM Tasks O t aBB bB(Ot) Question 1: State Estimation Bus What is P(qT=Si | O1O2…OT) aAB aCB Ot-1 Ot+1 Question 2: Most Probable Path aBA aBC b (O ) bC(Ot+1) Given O1O2…OT , what is A t-1 Eat walk the most probable path a aCC that I took? AA Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Copyright © Andrew W. Moore Slide 37 Basic Operations in HMMs For an observation sequence O = O1…OT, the three basic HMM operations are: Problem Algorithm Complexity + Forward-Backward Evaluation: O(TN2) Calculating P(qt=Si | O1O2…Ot) Viterbi Decoding Inference: O(TN2) Computing Q* = argmaxQ P(Q|O) Baum-Welch (EM) Learning: O(TN2) Computing λ* = argmaxλ P(O|λ) T = # timesteps, N = # states Copyright © Andrew W. Moore Slide 38 19
20. 20. HMM Notation *L. R. Rabiner, quot;A Tutorial on (from Rabiner’s Survey) Hidden Markov Models and Selected Applications in Speech Recognition,quot; Proc. of the IEEE, The states are labeled S1 S2 .. SN Vol.77, No.2, pp.257--286, 1989. Available from For a particular trial…. http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626 Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states λ = 〈N,M,{πi,},{aij},{bi(j)}〉 is the specification of an HMM Copyright © Andrew W. Moore Slide 39 HMM Formal Definition An HMM, λ, is a 5-tuple consisting of • N the number of states This is new. In our • M the number of possible observations previous example, • {π1, π2, .. πN} The starting state probabilities start state was P(q0 = Si) = πi deterministic • a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN • b1(1) b1(2) … b1(M) The observation probabilities b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k) : : : bN(1) bN(2) … bN(M) Copyright © Andrew W. Moore Slide 40 20
21. 21. Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY 2/3 2/3 ZX 1/3 1/3 S3 N=3 M=3 1/3 π1 = 1/2 π2 = 1/2 π3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 Copyright © Andrew W. Moore Slide 41 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 50-50 choice M=3 1/3 between S1 and π1 = ½ π2 = ½ π3 = 0 S2 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ __ __ a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= __ __ q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 __ __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 42 21
22. 22. Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 50-50 choice between X and Y M=3 1/3 π1 = ½ π2 = ½ π3 = 0 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 __ a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= __ __ q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 __ __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 43 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 Goto S3 with probability 2/3 or M=3 1/3 S2 with prob. 1/3 π1 = ½ π2 = ½ π3 = 0 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= __ __ q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 __ __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 44 22
23. 23. Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 50-50 choice between Z and X M=3 1/3 π1 = ½ π2 = ½ π3 = 0 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= S3 __ q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 __ __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 45 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 Each of the three M=3 1/3 next states is π1 = ½ π2 = ½ π3 = 0 equally likely a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= S3 X q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 __ __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 46 23
24. 24. Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 50-50 choice between Z and X M=3 1/3 π1 = ½ π2 = ½ π3 = 0 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= S3 X q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 S3 __ q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 47 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 S3 N=3 M=3 1/3 π1 = ½ π2 = ½ π3 = 0 a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ S1 X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= S3 X q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 S3 Z q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 48 24
25. 25. State Estimation Start randomly in state 1 or 2 Choose one of the output 1/3 S2 symbols in each state at S1 random. 1/3 ZY XY Let’s generate a sequence of 2/3 2/3 observations: ZX 1/3 1/3 This is what the S3 N=3 observer has to M=3 1/3 π1 = ½ π2 = ½ π3 = 0 work with… a11 = 0 a12 = ⅓ a13 = ⅔ a12 = ⅓ a22 = 0 a13 = ⅔ ? X a13 = ⅓ a32 = ⅓ a13 = ⅓ q0= O0= ? X q1= O1= b1 (X) = ½ b1 (Y) = ½ b1 (Z) = 0 ? Z q2= O2= b2 (X) = 0 b2 (Y) = ½ b2 (Z) = ½ b3 (X) = ½ b3 (Y) = 0 b3 (Z) = ½ Copyright © Andrew W. Moore Slide 49 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = 1/3 S2 S1 P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 ZY XY Slow, stupid way: 2/3 2/3 ∑ P(O ∧ Q) P (O) = ZX 1/3 1/3 Q∈Paths of length 3 S3 ∑ P(O | Q) P(Q) = 1/3 Q∈Paths of length 3 How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? Copyright © Andrew W. Moore Slide 50 25
26. 26. Prob. of a series of observations What is P(O) = P(O1 O2 O3) = 1/3 S2 S1 P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 ZY XY Slow, stupid way: 2/3 2/3 ∑ P(O ∧ Q) P (O) = ZX 1/3 1/3 Q∈Paths of length 3 S3 ∑ P(O | Q) P(Q) = 1/3 Q∈Paths of length 3 P(Q)= P(q1,q2,q3) =P(q1) P(q2,q3|q1) (chain rule) How do we compute P(Q) for an arbitrary path Q? =P(q1) P(q2|q1) P(q3| q2,q1) (chain) How do we compute P(O|Q) =P(q1) P(q2|q1) P(q3| q2) (why?) for an arbitrary path Q? Example in the case Q = S1 S3 S3: =1/2 * 2/3 * 1/3 = 1/9 Copyright © Andrew W. Moore Slide 51 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = 1/3 S2 S1 P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 ZY XY Slow, stupid way: 2/3 2/3 ∑ P(O ∧ Q) P (O) = ZX 1/3 1/3 Q∈Paths of length 3 S3 ∑ P(O | Q) P(Q) = 1/3 Q∈Paths of length 3 P(O|Q) How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) Example in the case Q = S1 S3 S3: for an arbitrary path Q? = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8 Copyright © Andrew W. Moore Slide 52 26
27. 27. Prob. of a series of observations What is P(O) = P(O1 O2 O3) = 1/3 S2 S1 P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 ZY XY Slow, stupid way: 2/3 2/3 ∑ P(O ∧ Q) P (O) = ZX 1/3 1/3 Q∈Paths of length 3 S3 ∑ P(O | Q) P(Q) = P (Q ) 1/3 d 27 nee Q∈Paths of length 3 P(O|Q) would |Q) O)O O |q q qd )27 P(O P( 1 2 3 1 an 3 How do we compute P(Q) for = P(O t ons 2 uta) iP(O | q ) P(O | q ) (why?) an arbitrary path Q? = com1p| q1 2 ions 3 P(O putat 2 3 omcase Q = S S eS320 = c How do we compute P(O|Q) Example in the d 3: 1 ne3 ould ns w |Q) for an arbitrary path Q? rvatio = P(X| S1) P(X| e 3) P(Z| bil3) n P(O 0 obs S S lio = of 2 .5 and 3 ence sequ * 1/2 pu1/2 ns 1/8 tatio = A =1/2 * c om illion 3 .5 b So let’s be smarter… s tation ompu c Copyright © Andrew W. Moore Slide 53 The Prob. of a given series of observations, non-exponential-cost-style Given observations O1 O2 … OT Define αt(i) = P(O1 O2 … Ot ∧ qt = Si | λ) where 1 ≤ t ≤ T αt(i) = Probability that, in a random trial, • We’d have seen the first t observations • We’d have ended up in Si as the t’th state visited. In our example, what is α2(3) ? Copyright © Andrew W. Moore Slide 54 27
28. 28. αt(i): easy to define recursively αt(i) = P(O1 O2 … OT ∧ qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?) α1 (i ) = P(O1 ∧ q1 = Si ) = P(q1 = Si )P(O1 q1 = S i ) = what? α t +1 ( j ) = P(O1O2 ...Ot Ot +1 ∧ qt +1 = S j ) = Copyright © Andrew W. Moore Slide 55 αt(i): easy to define recursively αt(i) = P(O1 O2 … OT ∧ qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?) α1 (i ) = P(O1 ∧ q1 = Si ) = P(q1 = Si )P(O1 q1 = Si ) = what? α t +1 ( j ) = P(O1O2 ...Ot Ot +1 ∧ qt +1 = S j ) = ∑ P(O1O2 ...Ot ∧ qt = Si ∧ Ot +1 ∧ qt +1 = S j ) N i =1 = ∑ P(Ot +1 , qt +1 = S j O1O2 ...Ot ∧ qt = Si )P(O1O2 ...Ot ∧ qt = Si ) N i =1 = ∑ P(Ot +1 , qt +1 = S j qt = Si ) t (i ) α i ( ) = ∑ P(qt +1 = S j qt = Si )P Ot +1 qt +1 = S j α t (i ) i = ∑ aij b j (Ot +1 ) t (i ) α i Copyright © Andrew W. Moore Slide 56 28
29. 29. in our example 1/3 S2 S1 α t (i ) = P(O1O2 ..Ot ∧ qt = Si λ ) 1/3 ZY XY 2/3 α1 (i ) = bi (O1 )π i 2/3 ZX 1/3 1/3 α t +1 ( j ) = ∑ aij b j (Ot +1 ) t (i ) α S3 1/3 i WE SAW O1 O2 O3 = X X Z 1 α1 (1) = α1 (2) = 0 α1 (3) = 0 4 1 α 2 (1) = 0 α 2 (2) = 0 α 2 (3) = 12 1 1 α 3 (1) = 0 α 3 (2) = α 3 (3) = 72 72 Copyright © Andrew W. Moore Slide 57 Easy Question We can cheaply compute αt(i)=P(O1O2…Ot∧qt=Si) (How) can we cheaply compute P(O1O2…Ot) ? (How) can we cheaply compute P(qt=Si|O1O2…Ot) Copyright © Andrew W. Moore Slide 58 29
30. 30. Easy Question We can cheaply compute αt(i)=P(O1O2…Ot∧qt=Si) (How) can we cheaply compute P(O1O2…Ot) ? N ∑α (i) t i =1 α t (i ) (How) can we cheaply compute N ∑α ( j) P(qt=Si|O1O2…Ot) t j =1 Copyright © Andrew W. Moore Slide 59 Most probable path given observations What' s most probable path given O1O2 ...OT , i.e. argmax P(Q O O ...O )? What is 1 2 T Q Slow, stupid answer : argmax P(Q O O ...O ) 1 2 T Q P(O1O2 ...OT Q )P(Q) = argmax P(O1O2 ...OT ) Q = argmax P(O1O2 ...OT Q )P(Q ) Q Copyright © Andrew W. Moore Slide 60 30
31. 31. Efficient MPP computation We’re going to compute the following variables: P(q1 q2 .. qt-1 ∧ qt = Si ∧ O1 .. Ot) δt(i)= max q1q2..qt-1 = The Probability of the path of Length t-1 with the maximum chance of doing all these things: …OCCURING and …ENDING UP IN STATE Si and …PRODUCING OUTPUT O1…Ot DEFINE: mppt(i) = that path So: δt(i)= Prob(mppt(i)) Copyright © Andrew W. Moore Slide 61 The Viterbi Algorithm max δ t (i ) = q1q2 ...qt −1 P(q1q2 ...qt −1 ∧ qt = Si ∧ O1O2 ..Ot ) arg max mppt (i ) = q q ...q P(q1q2 ...qt −1 ∧ qt = S i ∧ O1O2 ..Ot ) t −1 12 max δ 1 (i ) = one choice P(q1 = Si ∧ O1 ) = P(q1 = Si )P(O1 q1 = Si ) = π i bi (O1 ) Now, suppose we have all the δt(i)’s and mppt(i)’s for all i. HOW TO GET δt+1(j) and mppt+1(j)? ? mppt(1) S1 Prob=δt(1) mppt(2) S2 Sj : : Prob=δt(2) SN mppt(N) Prob=δt(N) qt+1 qt Copyright © Andrew W. Moore Slide 62 31
32. 32. The Viterbi Algorithm time t time t+1 S1 The most prob path with last : Sj two states Si Sj Si is : the most prob path to Si , followed by transition Si → Sj Copyright © Andrew W. Moore Slide 63 The Viterbi Algorithm time t time t+1 S1 The most prob path with last : Sj two states Si Sj Si is : the most prob path to Si , followed by transition Si → Sj What is the prob of that path? δt(i) x P(Si → Sj ∧ Ot+1 | λ) = δt(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) i Copyright © Andrew W. Moore Slide 64 32
33. 33. The Viterbi Algorithm time t time t+1 S1 The most prob path with last : Sj two states Si Sj Si is : the most prob path to Si , followed by transition Si → Sj What is the prob of that path? δt(i) x P(Si → Sj ∧ Ot+1 | λ) Summary: = δt(i) aij bj (Ot+1) } δt+1(j) = δt(i*) a b (Ot+1) with i* defined SO The most probable path to(j) j=has ij j (i*)S mppt+1 S mppt+1 to the left i* Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) i Copyright © Andrew W. Moore Slide 65 What’s Viterbi used for? Classic Example Speech recognition: Signal → words HMM → observable is signal → Hidden state is part of word formation What is the most probable word given this signal? UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump. Copyright © Andrew W. Moore Slide 66 33
34. 34. HMMs are used and useful But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O1 O2 .. OT with a big “T”. Observations previously O1 O2 .. OT in lecture Observations in the O1 O2 .. O T next bit Copyright © Andrew W. Moore Slide 67 Inferring an HMM Remember, we’ve been doing things like P(O1 O2 .. OT | λ ) That “λ” is the notation for our HMM parameters. Now We have some observations and we want to estimate λ from them. AS USUAL: We could use (i) MAX LIKELIHOOD λ = argmax P(O1 .. OT | λ) λ (ii) BAYES Work out P( λ | O1 .. OT ) and then take E[λ] or max P( λ | O1 .. OT ) λ Copyright © Andrew W. Moore Slide 68 34