Hidden Markov Models with applications to speech recognition


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hidden Markov Models with applications to speech recognition

  1. 1. Hidden Markov Models By Marc Sobel
  2. 2. Introduction <ul><li>Modeling dependencies in input; no longer iid </li></ul><ul><li>Sequences: </li></ul><ul><ul><li>Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). </li></ul></ul><ul><ul><li>In handwriting, pen movements </li></ul></ul><ul><ul><li>Spatial: In a DNA sequence; base pairs </li></ul></ul>
  3. 3. Discrete Markov Process <ul><li>N states: S 1 , S 2 , ..., S N State at “time” t , q t = S i </li></ul><ul><li>First-order Markov </li></ul><ul><li> P ( q t +1 = S j | q t = S i , q t -1 = S k ,...) = P ( q t +1 = S j | q t = S i ) </li></ul><ul><li>Transition probabilities </li></ul><ul><li> a ij ≡ P ( q t +1 = S j | q t = S i ) a ij ≥ 0 and Σ j =1 N a ij =1 </li></ul><ul><li>Initial probabilities </li></ul><ul><li> π i ≡ P ( q 1 = S i ) Σ j =1 N π i =1 </li></ul>
  4. 4. Time-based Models <ul><li>The models typically examined by statistics: </li></ul><ul><ul><li>Simple parametric distributions </li></ul></ul><ul><ul><li>Discrete distribution estimates </li></ul></ul><ul><li>These are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering. </li></ul><ul><li>What if the data has correlations based on its order, like a time-series? </li></ul>
  5. 5. Applications of time based models <ul><li>Sequential pattern recognition is a relevant problem in a number of disciplines </li></ul><ul><ul><li>Human-computer interaction: Speech recognition </li></ul></ul><ul><ul><li>Bioengineering: ECG and EEG analysis </li></ul></ul><ul><ul><li>Robotics: mobile robot navigation </li></ul></ul><ul><ul><li>Bioinformatics: DNA base sequence alignment </li></ul></ul>
  6. 6. Andrei Andreyevich Markov Born: 14 June 1856 in Ryazan, Russia Died: 20 July 1922 in Petrograd (now St Petersburg), Russia Markov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes .
  7. 7. Markov random processes <ul><li>A random sequence has the Markov property if its distribution is determined solely by its current state. Any random process having this property is called a Markov random process . </li></ul><ul><li>For observable state sequences (state is known from data), this leads to a Markov chain model. </li></ul><ul><li>For non-observable states, this leads to a Hidden Markov Model (HMM). </li></ul>
  8. 8. Chain Rule & Markov Property Bayes rule Markov property
  9. 9. s 1 s 3 s 2 Has N states, called s 1 , s 2 .. s N There are discrete timesteps, t=0, t=1, … N = 3 t=0 A Markov System
  10. 10. Example: Balls and Urns (markov process with a non-hidden observation process – stochastic automoton <ul><li>Three urns each full of balls of one color </li></ul><ul><li>S 1 : red, S 2 : blue, S 3 : green </li></ul>
  11. 11. A Plot of 100 observed numbers for the stochastic automoton
  12. 12. Histogram for the stochastic automaton: the proportions reflect the stationary distribution of the chain
  13. 13. Hidden Markov Models <ul><li>States are not observable </li></ul><ul><li>Discrete observations { v 1 , v 2 ,..., v M } are recorded; a probabilistic function of the state </li></ul><ul><li>Emission probabilities </li></ul><ul><li>b j ( m ) ≡ P ( O t = v m | q t = S j ) </li></ul><ul><li>Example: In each urn, there are balls of different colors, but with different probabilities. </li></ul><ul><li>For each observation sequence, there are multiple state sequences </li></ul>
  14. 14. From Markov To Hidden Markov <ul><li>The previous model assumes that each state can be uniquely associated with an observable event </li></ul><ul><ul><li>Once an observation is made, the state of the system is then trivially retrieved </li></ul></ul><ul><ul><li>This model, however, is too restrictive to be of practical use for most realistic problems </li></ul></ul><ul><li>To make the model more flexible, we will assume that the outcomes or observations of the model are a probabilistic function of each state </li></ul><ul><ul><li>Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state </li></ul></ul><ul><ul><li>These are known a Hidden Markov Models (HMM) , because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system </li></ul></ul>
  15. 15. The coin-toss problem <ul><li>To illustrate the concept of an HMM consider the following scenario </li></ul><ul><ul><li>Assume that you are placed in a room with a curtain </li></ul></ul><ul><ul><li>Behind the curtain there is a person performing a coin-toss experiment </li></ul></ul><ul><ul><li>This person selects one of several coins, and tosses it: heads (H) or tails (T) </li></ul></ul><ul><ul><li>The person tells you the outcome (H,T), but not which coin was used each time </li></ul></ul><ul><li>Your goal is to build a probabilistic model that best explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…} </li></ul><ul><ul><li>The coins represent the states; these are hidden because you do not know which coin was tossed each time </li></ul></ul><ul><ul><li>The outcome of each toss represents an observation </li></ul></ul><ul><ul><li>A “likely” sequence of coins may be inferred from the observations, but this state sequence will not be unique </li></ul></ul>
  16. 16. Speech Recognition <ul><li>We record the sound signals associated with words. </li></ul><ul><li>We’d like to identify the ‘speech recognition features associated with pronouncing these words. </li></ul><ul><li>The features are the states and the sound signals are the observations. </li></ul>
  17. 17. The Coin Toss Example – 1 coin <ul><li>As a result, the Markov model is observable since there is only one state </li></ul><ul><li>In fact, we may describe the system with a deterministic model where the states are the actual observations (see figure) </li></ul><ul><li>the model parameter P(H) may be found from the ratio of heads and tails </li></ul><ul><li>O= H H H T T H… </li></ul><ul><li>S = 1 1 1 2 2 1… </li></ul>
  18. 18. The Coin Toss Example – 2 coins
  19. 19. From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins
  20. 20. 1, 2 or 3 coins? <ul><li>Which of these models is best? </li></ul><ul><ul><li>Since the states are not observable, the best we can do is select the model that best explains the data (e.g., Maximum Likelihood criterion) </li></ul></ul><ul><ul><li>Whether the observation sequence is long and rich enough to warrant a more complex model is a different story, though </li></ul></ul>
  21. 21. The urn-ball problem <ul><li>To further illustrate the concept of an HMM, consider this scenario </li></ul><ul><ul><li>You are placed in the same room with a curtain </li></ul></ul><ul><ul><li>Behind the curtain there are N urns, each containing a large number of balls with M different colors </li></ul></ul><ul><ul><li>The person behind the curtain selects an urn according to an internal random process, then randomly grabs a ball from the selected urn </li></ul></ul><ul><ul><li>He shows you the ball, and places it back in the urn </li></ul></ul><ul><ul><li>This process is repeated over and over </li></ul></ul><ul><li>Questions? </li></ul><ul><ul><li>How would you represent this experiment with an HMM? </li></ul></ul><ul><ul><li>What are the states? </li></ul></ul><ul><ul><li>Why are the states hidden? </li></ul></ul><ul><ul><li>What are the observations? </li></ul></ul>
  22. 22. Doubly Stochastic System The Urn-and-Ball Model O = {green, blue, green, yellow, red, ..., blue} How can we determine the appropriate model for the observation sequence given the system above?
  23. 23. Four Basic Problems of HMMs <ul><li>Evaluation: Given λ , and O , calculate P ( O | λ ) </li></ul><ul><li>State sequence: Given λ , and O , find Q * such that </li></ul><ul><li>P ( Q * | O , λ ) = max Q P ( Q | O , λ ) </li></ul><ul><li>Learning: Given X ={ O k } k , find λ * such that </li></ul><ul><li>P ( X | λ * )=max λ P ( X | λ ) </li></ul><ul><li>4. Statistical Inference : Given X ={ O k } k , and given observation distributions P( X | θ λ ) for different lambda’s, estimate the theta parameters. </li></ul>(Rabiner, 1989)
  24. 24. Example: Balls and Urns (HMM): Learning I <ul><li>Three urns each full of balls of different colors: </li></ul><ul><li>S 1 : state 1 , S 2 : state 2 , S 3 : state 3: start at urn 1. </li></ul>
  25. 25. Baum-Welch EM for Hidden Markov Models <ul><li>We use the notation q t for the probability of the result at time t; </li></ul><ul><li>a i[t-1],i[t] for the probability of going from the observed state at time t-1 to the observed state at time t; n i for the observed number of results i, and n i,j for the number of transitions from I to j; </li></ul>
  26. 26. Baum-Welch EM for hmm’s <ul><li>The constraints are that: </li></ul><ul><li>So, differentiating under constraints we get: </li></ul>
  27. 27. Observed colored balls in the hmm model
  28. 28. EM results <ul><li>We have, </li></ul>
  29. 29. More General Elements of an HMM <ul><li>N : Number of states </li></ul><ul><li>M : Number of observation symbols </li></ul><ul><li>A = [ a ij ]: N by N state transition probability matrix </li></ul><ul><li>B = b j (m) : N by M observation probability matrix </li></ul><ul><li>Π = [ π i ]: N by 1 initial state probability vector </li></ul><ul><li>λ = ( A , B , Π ), parameter set of HMM </li></ul>
  30. 30. Particle Evaluation <ul><li>At stage t, simulate the new state from the former state </li></ul><ul><li>using the distribution, and </li></ul><ul><li>Weight the result by, . The resulting weight for the j’th particle is: </li></ul><ul><li>We should use standard residual resampling. The result gets 50 percent accuracy [Note: I haven’t perfected good residual sampling]. </li></ul>
  31. 31. Particle Results: based on 50 observations
  32. 32. Viterbi’s Algorithm <ul><li>δ t ( i ) ≡ max q 1 q 2∙∙∙ qt -1 p ( q 1 q 2 ∙∙∙ q t -1 ,q t = S i , O 1 ∙∙∙ O t | λ) </li></ul><ul><li>Initialization: </li></ul><ul><li>δ 1 ( i ) = π i b i ( O 1 ), ψ 1 ( i ) = 0 </li></ul><ul><li>Recursion: </li></ul><ul><li> δ t ( j ) = max i δ t -1 ( i ) a ij b j ( O t ), ψ t ( j ) = argmax i δ t -1 ( i ) a ij </li></ul><ul><li>Termination: </li></ul><ul><li>p * = max i δ T ( i ), q T * = argmax i δ T ( i ) </li></ul><ul><li>Path backtracking: </li></ul><ul><ul><li>q t * = ψ t +1 ( q t +1 * ), t = T -1, T -2, ..., 1 </li></ul></ul>
  33. 33. Viterbi learning versus the actual state (estimate =3; 62% accuracy)
  34. 34. General EM <ul><li>At each step assume k states: </li></ul><ul><li>With p known and the theta’s unknown. We use the terminology Z 1 ,…,Z t for the (unobserved states). </li></ul><ul><li>Then the EM equation: (with the pi’s the stationary probabilities of the states) </li></ul>
  35. 35. EM Equations <ul><li>We have, </li></ul><ul><li>So, in the Poisson hidden case we have: </li></ul>
  36. 36. Binomial hidden model <ul><li>We have: </li></ul>
  37. 37. Coin-Tossing Model <ul><li>Coin 1: 0.2000 0.8000 </li></ul><ul><li>Coin 2: 0.7000 0.3000 </li></ul><ul><li>Coin 3: 0.5000 0.5000 </li></ul><ul><li>State Matrix: C1 C2 C3 </li></ul><ul><li>Coin 1 0.4000 0.3000 0.3000 </li></ul><ul><li>Coin 2 0.2000 0.6000 0.2000 </li></ul><ul><li>Coin 3 0.1000 0.1000 0.8000 </li></ul>
  38. 38. Coin tossing model: results
  39. 39. Maximum Likelihood Model <ul><li>Stationary distribution for states is: </li></ul><ul><li>0.1818 </li></ul><ul><li>0.2727 </li></ul><ul><li>0.5455 </li></ul><ul><li>Therefore using a binomial hidden HMM we get: </li></ul>
  40. 40. MCMC approach <ul><li>Update the posterior distributions for the parameters and the (unobserved) state variables. </li></ul>
  41. 41. Continuous Observations <ul><li>Discrete: </li></ul><ul><li>Gaussian mixture (Discretize using k -means): </li></ul><ul><li>Continuous: </li></ul>Use EM to learn parameters, e.g.,
  42. 42. HMM with Input <ul><li>Input-dependent observations: </li></ul><ul><li>Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996): </li></ul><ul><li>Time-delay input: </li></ul>
  43. 43. Model Selection in HMM <ul><li>Left-to-right HMMs: </li></ul><ul><li>In classification, for each C i , estimate P ( O | λ i ) by a separate HMM and use Bayes’ rule </li></ul>