Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

3,406 views

Published on

No Downloads

Total views

3,406

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

167

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Markov Models &Hidden Markov Models
- 2. Time-based Models• Simple parametric distributions are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.• What if the data has correlations based on its order, like a time-series?
- 3. States• An atomic event is an assignment to every random variable in the domain.• States are atomic events that can transfer from one to another• Suppose a model has n states• A state-transition diagram describes how the model behaves
- 4. State-transitionFollowing assumptions - Transition probabilities are stationary - The event space does not change over time - Probability distribution over next states depends only on the current state
- 5. State-transitionFollowing assumptions - Transition probabilities are stationary - The event space does not change over time - Probability distribution over next states depends only on the current state Markov Assumption
- 6. Markov random processes• A random sequence has the Markov property if its distribution is determined solely by its current state.• Any random process having this property is called a Markov random process.• A system with states that obey the Markov assumption is called a Markov Model.• A sequence of states resulting from such a model is called a Markov Chain.
- 7. Chain Rule & Markov Property Bayes rule P (qt , qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P (qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P(qt −1 | qt − 2 ,...q1 ) P(qt − 2 ,...q1 ) t = P (q1 )∏ P (qi | qi −1 ,...q1 ) i =2 Markov property P (qi | qi −1 ,...q1 ) = P (qi | qi −1 ) for i > 1 tP (qt , qt −1 ,...q1 ) = P (q1 )∏ P (qi | qi −1 ) = P (q1 ) P(q2 | q1 )...P(qt | qt −1 ) i=2
- 8. Markov Assumption• The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-1 at time t-1 – Chain rule: n ( 1 . n ∏ i 1 . i−) Pw . ,w)= Pw|w . ,w1 ,. ( ,. = i 2 – Markov assumption: n ( 1 .,w ∏ i i− ) Pw . n)≈ Pw|w1 ,. ( = i 2
- 9. Andrei Andreyevich MarkovBorn: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg),RussiaMarkov is particularly remembered for his study ofMarkov chains, sequences of random variables inwhich the future variable is determined by thepresent variable but is independent of the way inwhich the present state arose from its predecessors.This work launched the theory of stochasticprocesses.
- 10. A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 s1 s3N=3t=0
- 11. A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt ∈{s1, s2 .. sN } Current State s1 s3N=3t=0qt=q0=s3
- 12. A Markov System Current State Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt ∈{s1, s2 .. sN } Between each timestep, the next state is chosen randomly. s1 s3N=3t=1qt=q1=s2
- 13. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, t=0, t=1, …P(qt+1=s1|qt=s1) = 0 s2 On the t’th timestep the system is in exactlyP(qt+1=s2|qt=s1) = 0 one of the available states. Call it qtP(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN } Between each timestep, the next state is chosen randomly. s1 s3 The current state determines the probabilityN=3 distribution for the next state.t=1 P(qt+1=s1|qt=s3) = 1/3qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0
- 14. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, t=0, t=1, …P(qt+1=s1|qt=s1) = 0 s2 1/2 On the t’th timestep the system is in exactlyP(qt+1=s2|qt=s1) = 0 one of the available states. Call it qtP(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN } 1/2 2/3 Between each timestep, the next state is chosen randomly. s1 1/3 s3 The current state determines the probabilityN=3 1 distribution for the next state.t=1 P(qt+1=s1|qt=s3) = 1/3qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 Often notated with arcs between states
- 15. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov PropertyP(qt+1=s1|qt=s1) = 0 s2 qt+1 is conditionally independent of { qt-1, qt-2, 1/2 … q , q } given q .P(qt+1=s2|qt=s1) = 0 1 0 tP(qt+1=s3|qt=s1) = 1 In other words: 1/2 2/3 P(qt+1 = sj |qt = si ) = s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history) Notation:N=3 1t=1 P(qt+1=s1|qt=s3) = 1/3qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j ) P(qt+1=s3|qt=s3) = 0 π i = P(q1 = si )
- 16. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov PropertyP(qt+1=s1|qt=s1) = 0 s2 qt+1 is conditionally independent of { qt-1, qt-2, 1/2 … q , q } given q .P(qt+1=s2|qt=s1) = 0 1 0 tP(qt+1=s3|qt=s1) = 1 In other words: 1/2 2/3 P(qt+1 = sj |qt = si ) = s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history) Notation:N=3 1 Transition probabilityt=1 P(qt+1=s1|qt=s3) = 1/3qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j ) P(qt+1=s3|qt=s3) = 0 π i = P(q1 = si ) Initial probability
- 17. Example: A Simple Markov Model For Weather Prediction• Any given day, the weather can be described as being in one of three states: – State 1: precipitation (rain, snow, hail, etc.) – State 2: cloudy – State 3: sunny Transitions between states are described by the transition matrix This model can then be described by the following directed graph
- 18. Basic Calculations• Example: What is the probability that the weather for eight consecutive days is “sun- sun-sun-rain-rain-sun-cloudy-sun”?• Solution:• O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3
- 19. From Markov To Hidden Markov• The previous model assumes that each state can be uniquely associated with an observable event – Once an observation is made, the state of the system is then trivially retrieved – This model, however, is too restrictive to be of practical use for most realistic problems• To make the model more flexible, we will assume that the outcomes or observations of the model are a probabilistic function of each state – Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state – These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system
- 20. The coin-toss problem• To illustrate the concept of an HMM consider the following scenario – Assume that you are placed in a room with a curtain – Behind the curtain there is a person performing a coin-toss experiment – This person selects one of several coins, and tosses it: heads (H) or tails (T) – The person tells you the outcome (H,T), but not which coin was used each time• Your goal is to build a probabilistic model that best explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…} – The coins represent the states; these are hidden because you do not know which coin was tossed each time – The outcome of each toss represents an observation – A “likely” sequence of coins may be inferred from the observations, but this state sequence will not be unique•
- 21. The Coin Toss Example – 1 coin•As a result, the Markov model is observable since there is only one state•In fact, we may describe the system with a deterministic model where the states arethe actual observations (see figure)•the model parameter P(H) may be found from the ratio of heads and tails•O= H H H T T H…•S = 1 1 1 2 2 1…
- 22. The Coin Toss Example – 1 coin•As a result, the Markov model is observable since there is only one state•In fact, we may describe the system with a deterministic model where the states arethe actual observations (see figure)•the model parameter P(H) may be found from the ratio of heads and tails•O= H H H T T H…•S = 1 1 1 2 2 1…
- 23. The Coin Toss Example – 2 coins
- 24. From Markov to Hidden Markov Model:The Coin Toss Example – 3 coins
- 25. Hidden model• As spectators, we can not tell which coin is being used, all we can observe is the output (head/tail)• We assume the outputs are based on coin tendencies (output) probabilities
- 26. Coin Toss Example hidden state variables L = coinsC1 C2 Ci CL-1 CLP1 P2 Pi PL-1 PL observed data (“output”) = heads/tails
- 27. Hidden Markov Models• Used when states can not directly be observed, good for noisy data• Requirements: – A finite number of states, each with an output probability distribution – State transition probabilities – Observed phenomenon, which can be randomly generated given state-associated probabilities.
- 28. HMM Notation *L. R. Rabiner, "A Tutorial on (from Rabiner’s Survey) Hidden Markov Models and Selected Applications in Speech Recognition," Proc.The states are labeled S1 S2 .. SN of the IEEE, Vol.77, No.2, pp.257--286, 1989.For a particular trial…. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of statesλ = 〈N,M,{π i,},{aij},{bi(j)}〉 is the specification of an HMM
- 29. HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {π1, π2, .. πN} The starting state probabilities P(q0 = Si) = πi• a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN• b1(1) b1(2) … b1(M) The observation probabilities b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k) : : : bN(1) bN(2) … bN(M)
- 30. Assumptions• Markov assumption – States depend on previous states• Stationary assumption – Transition probabilities are independent of time (“memoryless”)• Output independence – Observations are independent of previous observations
- 31. The three main questions on HMMs• Evaluation – What is the probability that the observations were generated by a given model?• Decoding – Given a model and a sequence of observations, what is the most likely state observations?• Learning: – Given a model and a sequence of observations, how should we modify the model parameters to maximize p{observe|model)
- 32. The three main questions on HMMs1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ]• Decoding GIVEN a HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π | M ]5. Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (bi(.), aij) that maximize P[ x | θ ]
- 33. Let’s not be confused by notationP[ x | M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters θ = aij, ei(.)So, P[ x | θ ], and P[ x ] are the same, when the architecture, and the entire model, respectively, are impliedSimilarly, P[ x, π | M ] and P[ x, π ] are the sameIn the LEARNING problem we always write P[ x | θ ] to emphasize that we are seeking the θ that maximizes P[ x | θ ]
- 34. HMMs
- 35. Hidden Markov Models• Used when states can not directly be observed, good for noisy data• Requirements: – A finite number of states, each with an output probability distribution – State transition probabilities – Observed phenomenon, which can be randomly generated given state-associated probabilities.
- 36. Description Specification of an HMM• N - number of states – Q = {q1; q2; : : : ;qT} - set of states• M - the number of symbols (observables) – O = {o1; o2; : : : ;oT} - set of symbols
- 37. Description Specification of an HMM• A - the state transition probability matrix – aij = P(qt+1 = j|qt = i)• B- observation probability distribution – bj(k) = P(ot = k|qt = j) i ≤ k ≤ M• π - the initial state distribution
- 38. HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {π1, π2, .. πN} The starting state probabilities P(q0 = Si) = πi• a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN• b1(1) b1(2) … b1(M) The observation probabilities b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k) : : : bN(1) bN(2) … bN(M)
- 39. Assumptions• Markov assumption – States depend on previous states• Stationary assumption – Transition probabilities are independent of time (“memoryless”)• Output independence – Observations are independent of previous observations
- 40. The three main questions on HMMs• Evaluation – What is the probability that the observations were generated by a given model?• Decoding – Given a model and a sequence of observations, what is the most likely state observations?• Learning: – Given a model and a sequence of observations, how should we modify the model parameters to maximize p{observe|model)
- 41. Central problems Central problems in HMM modelling• Problem 1 Evaluation: – Probability of occurrence of a particular observation sequence, O = {o1,…,ok}, given the model – P(O|λ) – Complicated – hidden states – Useful in sequence classification
- 42. Central problems Central problems in HMM modelling• Problem 2 Decoding: – Optimal state sequence to produce given observations, O = {o1,…,ok}, given model – Optimality criterion – Useful in recognition problems
- 43. Central problems Central problems in HMM modelling• Problem 3 Learning: – Determine optimum model, given a training set of observations – Find λ, such that P(O|λ) is maximal
- 44. Task: Part-Of-Speech Tagging• Goal: Assign the correct part-of-speech to each word (and punctuation) in a text.• Example: Two old men bet on the game . CRD ADJ NN VBD Prep Det NN SYM• Learn a local model of POS dependencies, usually from pretagged data
- 45. Hidden Markov Models • Assume: POS generated as random process, and each POS randomly generates a word 0.2 ADJ 0.3 “cats” 0.2“a” 0.6 0.5 NNS 0.3 Det 0.9 “men” 0.5 0.4 NN “the” 0.1 “cat” “bet”
- 46. HMMs For Tagging• First-order (bigram) Markov assumptions: – Limited Horizon: Tag depends only on previous tag P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj) – Time invariance: No change over time P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)• Output probabilities: – Probability of getting word wk for tag tj: P(wk | tj) – Assumption: Not dependent on other tags or words!
- 47. Combining Probabilities• Probability of a tag sequence:P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)Assume t0 – starting tag: = P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)• Prob. of word sequence and tag sequence: P(W,T) = Πi P(ti-1ti) P(wi | ti)
- 48. Training from labeled training• Labeled training = each word has a POS tag• Thus: π(tj) = PMLE(tj) = C(tj) / N a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj) b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)• Smoothing applies as usual
- 49. Three Basic Problems• Compute the probability of a text: Pλ(W1,N)• Compute maximum probability tag sequence: arg maxT1,N Pλ(T1,N | W1,N)• Compute maximum likelihood model arg max λ Pλ(W1,N)
- 50. Central Problem 1: Naïve solution problems • State sequence Q = (q1,…qT) • Assume independent observations: TP (O | q, λ ) = ∏ P (ot | qt , λ ) = bq1 (o1 )bq 2 (o2 )...bqT (oT ) i =1NB Observations are mutually independent, given thehidden states. (Joint distribution of independentvariables factorises into marginal distributions of theindependent variables.)
- 51. Central Problem 1: Naïve solution problems• Observe that : P(q | λ ) = π q1aq1q 2 aq 2 q 3 ...aqT −1qT• And that: P (O | λ ) = ∑ P (O | q, λ ) P (q | λ ) q
- 52. Central Problem 1: Naïve solution problems• Finally get: P (O | λ ) = ∑ P (O | q, λ ) P (q | λ ) q NB: -The above sum is over all state paths -There are NT states paths, each ‘costing’ O(T) calculations, leading to O(TNT) time complexity.
- 53. Central Problem 1: Efficient solution problemsForward algorithm:• Define auxiliary forward variable α: α t (i ) = P(o1 ,..., ot | qt = i, λ ) αt(i) is the probability of observing a partial sequence of observables o1,…ot such that at time t, state qt=i
- 54. Central Problem 1: Efficient solution problems• Recursive algorithm: – Initialise: α1 (i ) = π i bi (o1 ) (Partial obs seq to t AND state i at t) – Calculate: x (transition to j at t+1) x (sensor) N α t +1 ( j ) = [∑ α t (i )aij ]b j (ot +1 ) Sum, as can reach j from any preceding state i =1 – Obtain: α incorporates partial obs seq to t N P (O | λ ) = ∑ α T (i ) i =1 Sum of different ways Complexity is O(N2T) of getting obs seq
- 55. Forward AlgorithmDefine αk(i) = P(w1,k, tk=ti)• For i = 1 To N: α1(i) = a(t0ti)b(w1 | ti)4. For k = 2 To T; For j = 1 To N: αk(j) = [Σ α i k-1 ] (i)a(titj) b(wk | tj)5. Then: Pλ(W1,T) = Σ α (i) i TComplexity = O(N2 T)
- 56. Pλ(W1,3) Forward Algorithm w1 w2 w3 1 a(t1t1) 1 a(t1t1) t α1(1) t α2(1) t1 α3(1) a(t2t1) a(t2t1) t2 α1(2) a(t3t1) t2 α2(2) a(t3t1) t2 α3(2)a(t0ti) t3 α1(3) t3 α2(3) t3 α3(3) a(t4t1) a(t4t1) t4 α1(4) t4 α2(4) t4 α3(4) a(t5t1) a(t5t1) t5 α1(5) t5 α2(5) t5 α3(5)
- 57. Central problems Problem 1: Alternative solutionBackward algorithm:• Define auxiliary forward variable β: β t (i ) = P (ot +1 , ot + 2 ,..., oT | qt = i, λ ) βt(i) – the probability of observing a sequence of observables ot+1,…,oT given state qt =i at time t, and λ
- 58. Central problems Problem 1: Alternative solution• Recursive algorithm: – Initialise: βT ( j ) = 1 – Calculate: N βt (i ) = ∑ β t +1 ( j )aij b j (ot +1 ) j =1 – Terminate: N p (O | λ ) = ∑ β1 (i ) t = T − 1,...,1 i =1 Complexity is O(N2T)
- 59. Backward AlgorithmDefine βk(i) = P(wk+1,N | tk=ti) --note the difference!• For i = 1 To N: βT(i) = 15. For k = T-1 To 1; For j = 1 To N: βk(j) = [Σ a(t t )b(w i j i k+1 | ti) βk+1(i) ]6. Then: Pλ(W1,T) = Σ a(t t )b(w | t ) β (i) i 0 i 1 i 1Complexity = O(Nt2 N)
- 60. Pλ(W1,3) Backward Algorithm w1 w2 w3 1 β1(1) a(t1t1) 1 β2(1) a(t1t1) β3(1) t t t1a(t0ti) a(t2t1) a(t2t1) t2 β1(2) t2 β2(2) t2 β3(2) a(t3t1) a(t3t1) t3 β1(3) t3 β2(3) t3 β3(3) a(t t ) 4 1 a(t t ) 4 1 t4 β1(4) t4 β2(4) t4 β3(4) a(t5t1) a(t5t1) t5 β1(5) t5 β2(5) t5 β3(5)
- 61. Viterbi Algorithm (Decoding)• Most probable tag sequence given text: T* = arg maxT Pλ(T | W) = arg maxT Pλ(W | T) Pλ(T) / Pλ(W) (Bayes’ Theorem) = arg maxT Pλ(W | T) Pλ(T) (W is constant for all T) = arg maxT Πi[a(ti-1ti) b(wi | ti) ] = arg maxT Σi log[a(ti-1ti) b(wi | ti) ]
- 62. w1 w2 w3 t1 t1 t1t0 t2 t2 t2 t3 t3 t3 A(,) t1 t2 t3 B(,) w1 w2 w3 t0 0.005 0.02 0.1 t1 0.2 0.005 0.005 t1 0.02 0.1 0.005 t2 0.02 0.2 0.0005 t2 0.5 0.0005 0.0005 t3 0.02 0.02 0.05 t3 0.05 0.05 0.005
- 63. w1 w2 w3 -1.7 -1.7 1 t 1 t -6 t1 -7.3 -3 -2.3 -0.3 -0.30 -1.7 2t t -3.4 t2 -4.7 t2 -10.3 -1 -1.3 -1.3 t3 -2.7 t3 -6.7 t3 -9.3 -log A t1 t2 t3 -log B w1 w2 w3 t0 2.3 1.7 1 t1 0.7 2.3 2.3 t1 1.7 1 2.3 t2 1.7 0.7 3.3 t2 0.3 3.3 3.3 t3 1.7 1.7 1.3 t3 1.3 1.3 2.3
- 64. Viterbi Algorithm• D(0, START) = 0• for each tag t != START do: D(1, t) = -∞• for i 1 to N do: a. for each tag tj do: D(i, tj) maxk D(i-1,tk)b(wi|tj)a(tktj) D(i, tj) maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)• log P(W,T) = maxj D(N, tj)where logb(wi|tj) = log b(wi|tj) and so forth
- 65. Question: Suppose the sequence of our game is: HHHTHHHTTHHTH? 0.5 start 0.5Heads Heads 0.1 fair loaded 0.9 0.9 Tails Tails 0.1 What is the probability of the sequence given the model?
- 66. Decoding• Suppose we have a text written by Shakespeare and a monkey. Can we tell who wrote what?• Text: Shakespeare or Monkey?• Case 1: – Fehwufhweuromeojulietpoisonjigjreijge• Case 2: – mmmmbananammmmmmmbananammm

No public clipboards found for this slide

Be the first to comment