Published on


Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. A Revealing Introduction to Hidden Markov Models Mark Stamp HMM 1
  2. 2. Hidden Markov Models What is a hidden Markov model (HMM)? A machine learning technique and…  A discrete hill climb technique  Two for the price of one! Where are HMMs used?  Speech recognition, malware detection, IDS, etc., etc., etc. Why is it useful?  Easy to apply and efficient algorithms HMM 2
  3. 3. Markov Chain Markov chain:  “memoryless random process”  Transitions depend only on current state (Markov chain of order 1) and transition probability matrix Example?  See next slide… HMM 3
  4. 4. Markov Chain 0.7 Supposewe’re interested in average annual temperature H  Only consider Hot and Cold From recorded history, we 0.3 0.4 obtain probabilities in diagram to the right C 0.6 HMM 4
  5. 5. Markov Chain 0.7 Transition probability matrix H 0.3 0.4 Matrix is denoted as A C Note, A is “row stochastic” 0.6 HMM 5
  6. 6. Markov Chain Can also include 0.7 begin, end states 0.6 Begin state H matrix is π  In this example, begin 0.3 0.4 end C Note that π is also 0.4 row stochastic 0.6 HMM 6
  7. 7. Hidden Markov Model HMM includes a Markov chain  But the Markov process is “hidden” Cannot observe the Markov process  Instead, we observe something related (by probabilities) to hidden states  It’s as if there is a “curtain” between Markov chain and observations Example on next few slides… HMM 7
  8. 8. HMM Example ConsiderH/C temperature example Suppose we want to know H or C annual temperature in distant past  Before thermometers (or humans) invented  We just want to decide between H and C We assume transition between Hot and Cold years is same as today  So, the A matrix is known HMM 8
  9. 9. HMM Example Temp in past determined by Markov process But, we cannot observe temperature in past We findthat tree ring size is related to temperature  Look at historical data to see the connection We consider 3 tree ring sizes  Small, Medium, Large (S, M, L, respectively) Measure tree ring sizes and recorded temperatures to determine relationship HMM 9
  10. 10. HMM Example Wefind that tree ring sizes and temperature related by This is known as the B matrix: Note that B is also row stochastic HMM 10
  11. 11. HMM Example Can we now find H/C temps in past? We cannot measure (observe) temps But we can measure tree ring sizes… …and tree ring sizes related to temps  By the B matrix Weought to be able to say something about average annual temperature HMM 11
  12. 12. HMM NotationA lot of notation is required  Notation may be the most difficult part HMM 12
  13. 13. HMM Notation To simplify notation, observations are taken from the set {0,1,…,M-1} That is, The matrix A = {aij} is N x N, where The matrix B = {bj(k)} is N x M, where HMM 13
  14. 14. HMM Example Consider our temperature example… What are the observations?  V = {0,1,2}, which corresponds to S,M,L What are states of Markov process?  Q = {H,C} What are A,B,π, and T?  A,B,πon previous slides  T is number of tree rings measured What are N and M?  N = 2 and M = 3 HMM 14
  15. 15. Generic HMM Generic view of HMM HMM defined by A,B,andπ We denote HMM “model” as λ = (A,B,π) HMM 15
  16. 16. HMM Example Suppose that we observe tree ring sizes  For 4 year period of interest: S,M,S,L  Then = (0, 1, 0, 2) Most likely (hidden) state sequence?  We want most likely X = (x0, x1, x2, x3) Let πx0be prob. of starting in state x0 Note prob. of initial observation  And ax0,x1 is prob. of transition x0 to x1 And so on… HMM 16
  17. 17. HMM Example Bottom line? We can compute P(X) for any X For X = (x0, x1, x2, x3) we have Suppose we observe (0,1,0,2), then what is probability of, say, HHCC? Plug into formula above to find HMM 17
  18. 18. HMM Example Do same for all 4-state sequences We find… The winner is?  CCCH Not so fast my friend… HMM 18
  19. 19. HMM Example The pathCCCH scores the highest In dynamic programming (DP), we find highest scoring path But, HMM maximizes expected number of correct states  Sometimes called “EM algorithm”  For “Expectation Maximization” How does HMM work in this example? HMM 19
  20. 20. HMM Example For first position…  Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position --- biggest wins Repeat for each position and we find HMM 20
  21. 21. HMM Example So, HMM solution gives us CHCH While DP solution is CCCH Which solution is better? Neither!!!  They use different definitions of “best” HMM 21
  22. 22. HMM Paradox? HMM maximizes expected number of correct states  Whereas DP chooses “best” overall path Possiblefor HMM to choose a “path” that is impossible  Could be a transition probability of 0 Cannot get impossible path with DP Is this a flaw with HMM?  No, it’s a feature… HMM 22
  23. 23. HMM Model An HMM is defined by the three matrices, A, B, and π Note that M and N are implied, since they are the dimensions of the matrices So, we denote HMM “model” asλ = (A,B,π) HMM 23
  24. 24. The Three Problems HMMs used to solve 3 problems Problem 1: Given a model λ = (A,B,π) and observation sequence O, find P(O|λ)  That is, we can score an observation sequence to see how well it fits a given model Problem 2: Given λ = (A,B,π) and O, find an optimal state sequence  Uncover hidden part (like previous example) Problem 3: Given O, N, and M, find the model λ that maximizes probability of O  That is, train a model to fit observations HMM 24
  25. 25. HMMs in Practice Typically,HMMs used as follows: Given an observation sequence…  Assume a (hidden) Markov process exists Train a model based on observations  Problem 3 (find N by trial and error) Thengiven a sequence of observations, score it versus the model  Problem 1: high score implies it’s similar to training data, low score implies it’s not HMM 25
  26. 26. HMMs in Practice Previousslide gives sense in which HMM is a “machine learning” technique  To train model, we do not need to specify anything except the parameter N  And “best” N found by trial and error That is, we don’t have to think too much  Just train HMM and then use it  Best of all, efficient algorithms for HMMs HMM 26
  27. 27. The Three Solutions We give detailed solutions to the three problems  Note: We must provide efficient solutions Recall the three problems:  Problem 1: Score an observation sequence versus a given model  Problem 2: Given a model, “uncover” hidden part  Problem 3: Given an observation sequence, train a model HMM 27
  28. 28. Solution 1 Score observations versus a given model  Given model λ = (A,B,π) and observation sequence O=(O0,O1,…,OT-1), find P(O|λ) Denote hidden states as X = (x0, x1, . . . , xT-1) Then from definition of B,P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1) And from definition of A and π,P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1 HMM 28
  29. 29. Solution 1 Elementary conditional probability fact:P(O,X|λ) = P(O|X,λ) P(X|λ) Sum over all possible state sequences X, P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ) = Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1) This “works” but way too costly Requires about 2TNT multiplications  Why? There better be a better way… HMM 29
  30. 30. Forward Algorithm Instead of brute force: forward algorithm  Or “alpha pass” For t = 0,1,…,T-1 and i=0,1,…,N-1, letαt(i) = P(O0,O1,…,Ot,xt=qi|λ) Probability of “partial sum” to t, and Markov process is in state qi at step t  What the? Can be computed recursively, efficiently HMM 30
  31. 31. Forward Algorithm Let α0(i) = πibi(O0) for i = 0,1,…,N-1 For t = 1,2,…,T-1 and i=0,1,…,N-1, letαt(i) = (Σαt-1(j)aji)bi(Ot)  Where the sum is from j = 0 to N-1 From definition of αt(i) we seeP(O|λ) = ΣαT-1(i)  Where the sum is from i = 0 to N-1 Note this requires only N2T multiplications HMM 31
  32. 32. Solution 2 Given a model, find “most likely” hidden states: Given λ = (A,B,π) and O, find an optimal state sequence  Recall that optimal means “maximize expected number of correct states”  In contrast, DP finds best scoring path For temp/tree ring example, solved this  But hopelessly inefficient approach A better way: backward algorithm  Or “beta pass” HMM 32
  33. 33. Backward Algorithm For t = 0,1,…,T-1 and i=0,1,…,N-1, letβt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ) Probability of partial sum from t to end and Markov process in state qi at step t Analogous to the forward algorithm As with forward algorithm, this can be computed recursively and efficiently HMM 33
  34. 34. Backward Algorithm Let βT-1(i) = 1for i = 0,1,…,N-1 For t = T-2,T-3, …,1 and i=0,1,…,N-1, letβt(i) = Σaijbj(Ot+1)βt+1(j)  Where the sum is from j = 0 to N-1 HMM 34
  35. 35. Solution 2 For t = 1,2,…,T-1 and i=0,1,…,N-1 defineγt(i) = P(xt=qi|O,λ)  Most likely state at t is qi that maximizes γt(i) Note that γt(i) = αt(i)βt(i)/P(O|λ)  And recall P(O|λ) = ΣαT-1(i) The bottom line?  Forward algorithm solves Problem 1  Forward/backward algorithms solve Problem 2 HMM 35
  36. 36. Solution 3 Train a model: Given O, N, and M, find λ that maximizes probability of O Here, we iteratively adjust λ = (A,B,π) to better fit the given observations O  The size of matrices are fixed (N and M)  But elements of matrices can change It is amazing that this works!  And even more amazing that it’s efficient HMM 36
  37. 37. Solution 3 For t=0,1,…,T-2 and i,jin {0,1,…,N- 1}, define “di-gammas” asγt(i,j) = P(xt=qi, xt+1=qj|O,λ) Note γt(i,j) is prob of being in state qi at time t and transiting to state qj at t+1 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ) And γt(i) = Σγt(i,j)  Where sum is from j = 0 to N – 1 HMM 37
  38. 38. Model Re-estimation Given di-gammas and gammas… For i = 0,1,…,N-1 let πi = γ0(i) For i = 0,1,…,N-1 and j = 0,1,…,N-1aij = Σγt(i,j)/Σγt(i)  Where both sums are from t = 0 to T-2 For j = 0,1,…,N-1 and k = 0,1,…,M-1bj(k) = Σγt(j)/Σγt(j)  Both sums from from t = 0 to T-2 but only t for which Ot = kare counted in numerator Why does this work? HMM 38
  39. 39. Solution 3 To summarize…1. Initialize λ = (A,B,π)2. Compute αt(i), βt(i), γt(i,j), γt(i)3. Re-estimate the model λ = (A,B,π)4. If P(O|λ) increases, goto 2 HMM 39
  40. 40. Solution 3 Some fine points… Model initialization  If we have a good guess for λ = (A,B,π) then we can use it for initialization  If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M  Subject to row stochastic conditions  But, do not initialize to uniform values Stopping conditions  Stop after some number of iterations and/or…  Stop if increase in P(O|λ) is “small” HMM 40
  41. 41. HMM as Discrete Hill Climb Algorithm on previous slides shows that HMM is a “discrete hill climb” HMM consists of discrete parameters  Specifically, the elements of the matrices And re-estimation process improves model by modifying parameters  So, process “climbs” toward improved model  This happens in a high-dimensional space HMM 41
  42. 42. Dynamic Programming Brief detour… For λ = (A,B,π) as above, it’s easy to define a dynamic program (DP) Executive summary:  DP is forward algorithm, with “sum” replaced by “max” Precise details on next few slides HMM 42
  43. 43. Dynamic Programming Let δ0(i) = πi bi(O0)for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 computeδt(i) = max (δt-1(j)aji)bi(Ot)  Where the max is over j in {0,1,…,N-1} Note that at each t, the DP computes best path for each state, up to that point So, probability of best path is max δT-1(j) This max gives the best probability  Not the best path, for that, see next slide HMM 43
  44. 44. Dynamic Programming To determine optimal path  While computing deltas, keep track of pointers to previous state  When finished, construct optimal path by tracing back points For example, consider temp example: recall that we observe (0,1,0,2) Probabilities for path of length 1: These are the only “paths” of length 1 HMM 44
  45. 45. Dynamic Programming Probabilities for each path of length 2 Best path of length 2 ending with H is CH Best path of length 2 ending with C is CC HMM 45
  46. 46. Dynamic Program Continuing,we compute best path ending at H and C at each step And save pointers --- why? HMM 46
  47. 47. Dynamic Program Best final score is .002822  And, thanks to pointers, best path is CCCH But what about underflow? A serious problem in bigger cases HMM 47
  48. 48. Underflow Resistant DP Common trick to prevent underflow  Insteadof multiplying probabilities…  …we add logarithms of probabilities Why does this work?  Because log(xy) = log x + log y  Adding logs does not tend to 0 Note that we must avoid 0 probabilities HMM 48
  49. 49. Underflow Resistant DP Underflow resistant DP algorithm: Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 computeδt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))  Where the max is over j in {0,1,…,N-1} And score of best path is max δT-1(j) As before, must also keep track of paths HMM 49
  50. 50. HMM Scaling Trickierto prevent underflow in HMM We consider solution 3  Since it includes solutions 1 and 2 Recall for t = 1,2,…,T-1, i=0,1,…,N-1,αt(i) = (Σαt-1(j)aj,i)bi(Ot) The idea is to normalize alphas so that they sum to 1  Algorithm on next slide HMM 50
  51. 51. HMM Scaling Given αt(i) = (Σαt-1(j)aj,i)bi(Ot) Let a0(i) = α0(i) for i=0,1,…,N-1 Let c0 = 1/Σa0(j) For i = 0,1,…,N-1, let a0(i) = c0a0(i) This takes care of t = 0 case Algorithm continued on next slide… HMM 51
  52. 52. HMM Scaling For t = 1,2,…,T-1 do the following: For i = 0,1,…,N-1,at(i) = (Σat-1(j)aj,i)bi(Ot) Let ct = 1/Σat(j) For i = 0,1,…,N-1 let at(i) = ctat(i) HMM 52
  53. 53. HMM Scaling Easy to show at(i) = c0c1…ct αt(i) (♯)  Simple proof by induction So, c0c1…ct is scaling factor at step t Also, easy to show thatat(i) = αt(i)/Σαt(j) Which implies ΣaT-1(i) = 1 (♯♯) HMM 53
  54. 54. HMM Scaling By combining (♯) and (♯♯), we have1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i) = c0c1…cT-1 P(O|λ) Therefore, P(O|λ) = 1 / c0c1…cT-1 To avoid underflow, we computelog P(O|λ) = -Σlog(cj)  Where sum is from j = 0 to T-1 HMM 54
  55. 55. HMM Scaling Similarly, scale betas as ctβt(i) For re-estimation,  Computeγt(i,j)andγt(i)using original formulas, but with scaled alphas and betas This gives us new values for λ = (A,B,π) “Easy exercise” to show re-estimate is exact when scaled alphas and betas used Also, P(O|λ) cancels from formula  Use log P(O|λ) = -Σlog(cj) to decide if iterate improves HMM 55
  56. 56. All Together Now Complete pseudo code for Solution 3 Given: (O0,O1,…,OT-1) and N and M Initialize: λ = (A,B,π)  A is NxN, B is NxM and π is 1xN  πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row stochastic, but not uniform Initialize:  maxIters = max number of re-estimation steps  iters = 0  oldLogProb = -∞ HMM 56
  57. 57. Forward Algorithm Forward algorithm  With scaling HMM 57
  58. 58. Backward Algorithm Backward algorithm or “beta pass”  With scaling Note:same scaling factor as alphas HMM 58
  59. 59. Gammas Using scaled alphas and betas So formulas unchanged HMM 59
  60. 60. Re-Estimation Again, using scaled gammas So formulas unchanged HMM 60
  61. 61. Stopping Criteria Checkthat probability increases  In practice, want logProb>oldLogPro b+ε And don’t exceed max iterations HMM 61
  62. 62. English Text Example Suppose Martian arrives on earth  Sees written English text  Wants to learn something about it  Martians know about HMMs So,strip our all non-letters, make all letters lower-case  27 symbols (letters, plus word-space)  Train HMM on long sequence of symbols HMM 62
  63. 63. English Text For first training case, initialize: N = 2 and M = 27  Elements of A and πare about ½ each  Elements of B are each about 1/27 We use 50,000 symbols for training After 1stiter: log P(O|λ) ≈ -165097 After 100thiter: log P(O|λ) ≈ -137305 HMM 63
  64. 64. English Text Matrices A and πconverge: What does this tells us?  Started in hidden state 1 (not state 0)  And we know transition probabilities between hidden states Nothing too interesting here  We don’t care about hidden states HMM 64
  65. 65. English Text What about B matrix? This much more interesting…  Why??? HMM 65
  66. 66. A Security Application Suppose we want to detect metamorphic computer viruses  Such viruses vary their internal structure  But function of malware stays same  If sufficiently variable, standard signature detection will fail Can we use HMM for detection?  What to use as observation sequence?  Is there really a “hidden” Markov process?  What about N, M, and T?  How many Os needed for training, scoring? HMM 66
  67. 67. HMM for Metamorphic Detection Set of “family” viruses into 2 subsets Extract opcodes from each virus Append opcodes from subset 1 to make one long sequence  Train HMM on opcode sequence (problem 3)  Obtain a model λ = (A,B,π) Set threshold: score opcodes from files in subset 2 and “normal” files (problem 1)  Can you sets a threshold that separates sets?  If so, may have a viable detection method HMM 67
  68. 68. HMM for Metamorphic Detection Virus detection results from recent paper  Note the separation This is good! HMM 68
  69. 69. HMM Generalizations Here, assumed Markov process of order 1  Current state depends only on previous state and transition matrix A Can use higher order Markov process  Current state depends on n previous states  Higher order vs size of N ? “Depth” vs “width” Can have A and B matrices depend on t HMM often combined with other techniques (e.g., neural nets) HMM 69
  70. 70. Generalizations Insome cases, limitation of HMM is that position information is not used  In many applications this is OK/desirable  In some apps, this is a serious problem Bioinformatics applications  DNA sequencing, protein alignment, etc.  Sequence alignment is crucial  They use “profile HMMs” instead of HMMs HMM 70
  71. 71. ReferencesArevealing introduction to hidden Markov models, by M. Stamp  /HMM.pdfA tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner  ner.pdf HMM 71
  72. 72. References Hunting for metamorphic engines, W. Wong and M. Stamp  Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229 Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp  Journalin Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214 HMM 72