Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Hidden markov model ppt by Shivangi Saxena 7394 views
- Lecture 7: Hidden Markov Models (HMMs) by Marina Santini 9377 views
- HMM (Hidden Markov Model) by Maharaj Vinayak G... 5390 views
- Hidden Markov Models by Vu Pham 8281 views
- Introduction to Iteratees (Scala) by Alexander Lehmann 6762 views
- Introducing the Ceylon Project - Ga... by devstonez 29573 views

1,319 views

Published on

hmm

No Downloads

Total views

1,319

On SlideShare

0

From Embeds

0

Number of Embeds

7

Shares

0

Downloads

85

Comments

0

Likes

1

No embeds

No notes for slide

- 1. A Revealing Introduction to Hidden Markov Models Mark Stamp HMM 1
- 2. Hidden Markov Models What is a hidden Markov model (HMM)? A machine learning technique and… A discrete hill climb technique Two for the price of one! Where are HMMs used? Speech recognition, malware detection, IDS, etc., etc., etc. Why is it useful? Easy to apply and efficient algorithms HMM 2
- 3. Markov Chain Markov chain: “memoryless random process” Transitions depend only on current state (Markov chain of order 1) and transition probability matrix Example? See next slide… HMM 3
- 4. Markov Chain 0.7 Supposewe’re interested in average annual temperature H Only consider Hot and Cold From recorded history, we 0.3 0.4 obtain probabilities in diagram to the right C 0.6 HMM 4
- 5. Markov Chain 0.7 Transition probability matrix H 0.3 0.4 Matrix is denoted as A C Note, A is “row stochastic” 0.6 HMM 5
- 6. Markov Chain Can also include 0.7 begin, end states 0.6 Begin state H matrix is π In this example, begin 0.3 0.4 end C Note that π is also 0.4 row stochastic 0.6 HMM 6
- 7. Hidden Markov Model HMM includes a Markov chain But the Markov process is “hidden” Cannot observe the Markov process Instead, we observe something related (by probabilities) to hidden states It’s as if there is a “curtain” between Markov chain and observations Example on next few slides… HMM 7
- 8. HMM Example ConsiderH/C temperature example Suppose we want to know H or C annual temperature in distant past Before thermometers (or humans) invented We just want to decide between H and C We assume transition between Hot and Cold years is same as today So, the A matrix is known HMM 8
- 9. HMM Example Temp in past determined by Markov process But, we cannot observe temperature in past We findthat tree ring size is related to temperature Look at historical data to see the connection We consider 3 tree ring sizes Small, Medium, Large (S, M, L, respectively) Measure tree ring sizes and recorded temperatures to determine relationship HMM 9
- 10. HMM Example Wefind that tree ring sizes and temperature related by This is known as the B matrix: Note that B is also row stochastic HMM 10
- 11. HMM Example Can we now find H/C temps in past? We cannot measure (observe) temps But we can measure tree ring sizes… …and tree ring sizes related to temps By the B matrix Weought to be able to say something about average annual temperature HMM 11
- 12. HMM NotationA lot of notation is required Notation may be the most difficult part HMM 12
- 13. HMM Notation To simplify notation, observations are taken from the set {0,1,…,M-1} That is, The matrix A = {aij} is N x N, where The matrix B = {bj(k)} is N x M, where HMM 13
- 14. HMM Example Consider our temperature example… What are the observations? V = {0,1,2}, which corresponds to S,M,L What are states of Markov process? Q = {H,C} What are A,B,π, and T? A,B,πon previous slides T is number of tree rings measured What are N and M? N = 2 and M = 3 HMM 14
- 15. Generic HMM Generic view of HMM HMM defined by A,B,andπ We denote HMM “model” as λ = (A,B,π) HMM 15
- 16. HMM Example Suppose that we observe tree ring sizes For 4 year period of interest: S,M,S,L Then = (0, 1, 0, 2) Most likely (hidden) state sequence? We want most likely X = (x0, x1, x2, x3) Let πx0be prob. of starting in state x0 Note prob. of initial observation And ax0,x1 is prob. of transition x0 to x1 And so on… HMM 16
- 17. HMM Example Bottom line? We can compute P(X) for any X For X = (x0, x1, x2, x3) we have Suppose we observe (0,1,0,2), then what is probability of, say, HHCC? Plug into formula above to find HMM 17
- 18. HMM Example Do same for all 4-state sequences We find… The winner is? CCCH Not so fast my friend… HMM 18
- 19. HMM Example The pathCCCH scores the highest In dynamic programming (DP), we find highest scoring path But, HMM maximizes expected number of correct states Sometimes called “EM algorithm” For “Expectation Maximization” How does HMM work in this example? HMM 19
- 20. HMM Example For first position… Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position --- biggest wins Repeat for each position and we find HMM 20
- 21. HMM Example So, HMM solution gives us CHCH While DP solution is CCCH Which solution is better? Neither!!! They use different definitions of “best” HMM 21
- 22. HMM Paradox? HMM maximizes expected number of correct states Whereas DP chooses “best” overall path Possiblefor HMM to choose a “path” that is impossible Could be a transition probability of 0 Cannot get impossible path with DP Is this a flaw with HMM? No, it’s a feature… HMM 22
- 23. HMM Model An HMM is defined by the three matrices, A, B, and π Note that M and N are implied, since they are the dimensions of the matrices So, we denote HMM “model” asλ = (A,B,π) HMM 23
- 24. The Three Problems HMMs used to solve 3 problems Problem 1: Given a model λ = (A,B,π) and observation sequence O, find P(O|λ) That is, we can score an observation sequence to see how well it fits a given model Problem 2: Given λ = (A,B,π) and O, find an optimal state sequence Uncover hidden part (like previous example) Problem 3: Given O, N, and M, find the model λ that maximizes probability of O That is, train a model to fit observations HMM 24
- 25. HMMs in Practice Typically,HMMs used as follows: Given an observation sequence… Assume a (hidden) Markov process exists Train a model based on observations Problem 3 (find N by trial and error) Thengiven a sequence of observations, score it versus the model Problem 1: high score implies it’s similar to training data, low score implies it’s not HMM 25
- 26. HMMs in Practice Previousslide gives sense in which HMM is a “machine learning” technique To train model, we do not need to specify anything except the parameter N And “best” N found by trial and error That is, we don’t have to think too much Just train HMM and then use it Best of all, efficient algorithms for HMMs HMM 26
- 27. The Three Solutions We give detailed solutions to the three problems Note: We must provide efficient solutions Recall the three problems: Problem 1: Score an observation sequence versus a given model Problem 2: Given a model, “uncover” hidden part Problem 3: Given an observation sequence, train a model HMM 27
- 28. Solution 1 Score observations versus a given model Given model λ = (A,B,π) and observation sequence O=(O0,O1,…,OT-1), find P(O|λ) Denote hidden states as X = (x0, x1, . . . , xT-1) Then from definition of B,P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1) And from definition of A and π,P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1 HMM 28
- 29. Solution 1 Elementary conditional probability fact:P(O,X|λ) = P(O|X,λ) P(X|λ) Sum over all possible state sequences X, P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ) = Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1) This “works” but way too costly Requires about 2TNT multiplications Why? There better be a better way… HMM 29
- 30. Forward Algorithm Instead of brute force: forward algorithm Or “alpha pass” For t = 0,1,…,T-1 and i=0,1,…,N-1, letαt(i) = P(O0,O1,…,Ot,xt=qi|λ) Probability of “partial sum” to t, and Markov process is in state qi at step t What the? Can be computed recursively, efficiently HMM 30
- 31. Forward Algorithm Let α0(i) = πibi(O0) for i = 0,1,…,N-1 For t = 1,2,…,T-1 and i=0,1,…,N-1, letαt(i) = (Σαt-1(j)aji)bi(Ot) Where the sum is from j = 0 to N-1 From definition of αt(i) we seeP(O|λ) = ΣαT-1(i) Where the sum is from i = 0 to N-1 Note this requires only N2T multiplications HMM 31
- 32. Solution 2 Given a model, find “most likely” hidden states: Given λ = (A,B,π) and O, find an optimal state sequence Recall that optimal means “maximize expected number of correct states” In contrast, DP finds best scoring path For temp/tree ring example, solved this But hopelessly inefficient approach A better way: backward algorithm Or “beta pass” HMM 32
- 33. Backward Algorithm For t = 0,1,…,T-1 and i=0,1,…,N-1, letβt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ) Probability of partial sum from t to end and Markov process in state qi at step t Analogous to the forward algorithm As with forward algorithm, this can be computed recursively and efficiently HMM 33
- 34. Backward Algorithm Let βT-1(i) = 1for i = 0,1,…,N-1 For t = T-2,T-3, …,1 and i=0,1,…,N-1, letβt(i) = Σaijbj(Ot+1)βt+1(j) Where the sum is from j = 0 to N-1 HMM 34
- 35. Solution 2 For t = 1,2,…,T-1 and i=0,1,…,N-1 defineγt(i) = P(xt=qi|O,λ) Most likely state at t is qi that maximizes γt(i) Note that γt(i) = αt(i)βt(i)/P(O|λ) And recall P(O|λ) = ΣαT-1(i) The bottom line? Forward algorithm solves Problem 1 Forward/backward algorithms solve Problem 2 HMM 35
- 36. Solution 3 Train a model: Given O, N, and M, find λ that maximizes probability of O Here, we iteratively adjust λ = (A,B,π) to better fit the given observations O The size of matrices are fixed (N and M) But elements of matrices can change It is amazing that this works! And even more amazing that it’s efficient HMM 36
- 37. Solution 3 For t=0,1,…,T-2 and i,jin {0,1,…,N- 1}, define “di-gammas” asγt(i,j) = P(xt=qi, xt+1=qj|O,λ) Note γt(i,j) is prob of being in state qi at time t and transiting to state qj at t+1 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ) And γt(i) = Σγt(i,j) Where sum is from j = 0 to N – 1 HMM 37
- 38. Model Re-estimation Given di-gammas and gammas… For i = 0,1,…,N-1 let πi = γ0(i) For i = 0,1,…,N-1 and j = 0,1,…,N-1aij = Σγt(i,j)/Σγt(i) Where both sums are from t = 0 to T-2 For j = 0,1,…,N-1 and k = 0,1,…,M-1bj(k) = Σγt(j)/Σγt(j) Both sums from from t = 0 to T-2 but only t for which Ot = kare counted in numerator Why does this work? HMM 38
- 39. Solution 3 To summarize…1. Initialize λ = (A,B,π)2. Compute αt(i), βt(i), γt(i,j), γt(i)3. Re-estimate the model λ = (A,B,π)4. If P(O|λ) increases, goto 2 HMM 39
- 40. Solution 3 Some fine points… Model initialization If we have a good guess for λ = (A,B,π) then we can use it for initialization If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M Subject to row stochastic conditions But, do not initialize to uniform values Stopping conditions Stop after some number of iterations and/or… Stop if increase in P(O|λ) is “small” HMM 40
- 41. HMM as Discrete Hill Climb Algorithm on previous slides shows that HMM is a “discrete hill climb” HMM consists of discrete parameters Specifically, the elements of the matrices And re-estimation process improves model by modifying parameters So, process “climbs” toward improved model This happens in a high-dimensional space HMM 41
- 42. Dynamic Programming Brief detour… For λ = (A,B,π) as above, it’s easy to define a dynamic program (DP) Executive summary: DP is forward algorithm, with “sum” replaced by “max” Precise details on next few slides HMM 42
- 43. Dynamic Programming Let δ0(i) = πi bi(O0)for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 computeδt(i) = max (δt-1(j)aji)bi(Ot) Where the max is over j in {0,1,…,N-1} Note that at each t, the DP computes best path for each state, up to that point So, probability of best path is max δT-1(j) This max gives the best probability Not the best path, for that, see next slide HMM 43
- 44. Dynamic Programming To determine optimal path While computing deltas, keep track of pointers to previous state When finished, construct optimal path by tracing back points For example, consider temp example: recall that we observe (0,1,0,2) Probabilities for path of length 1: These are the only “paths” of length 1 HMM 44
- 45. Dynamic Programming Probabilities for each path of length 2 Best path of length 2 ending with H is CH Best path of length 2 ending with C is CC HMM 45
- 46. Dynamic Program Continuing,we compute best path ending at H and C at each step And save pointers --- why? HMM 46
- 47. Dynamic Program Best final score is .002822 And, thanks to pointers, best path is CCCH But what about underflow? A serious problem in bigger cases HMM 47
- 48. Underflow Resistant DP Common trick to prevent underflow Insteadof multiplying probabilities… …we add logarithms of probabilities Why does this work? Because log(xy) = log x + log y Adding logs does not tend to 0 Note that we must avoid 0 probabilities HMM 48
- 49. Underflow Resistant DP Underflow resistant DP algorithm: Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 computeδt(i) = max (δt-1(j) + log(aji) + log(bi(Ot))) Where the max is over j in {0,1,…,N-1} And score of best path is max δT-1(j) As before, must also keep track of paths HMM 49
- 50. HMM Scaling Trickierto prevent underflow in HMM We consider solution 3 Since it includes solutions 1 and 2 Recall for t = 1,2,…,T-1, i=0,1,…,N-1,αt(i) = (Σαt-1(j)aj,i)bi(Ot) The idea is to normalize alphas so that they sum to 1 Algorithm on next slide HMM 50
- 51. HMM Scaling Given αt(i) = (Σαt-1(j)aj,i)bi(Ot) Let a0(i) = α0(i) for i=0,1,…,N-1 Let c0 = 1/Σa0(j) For i = 0,1,…,N-1, let a0(i) = c0a0(i) This takes care of t = 0 case Algorithm continued on next slide… HMM 51
- 52. HMM Scaling For t = 1,2,…,T-1 do the following: For i = 0,1,…,N-1,at(i) = (Σat-1(j)aj,i)bi(Ot) Let ct = 1/Σat(j) For i = 0,1,…,N-1 let at(i) = ctat(i) HMM 52
- 53. HMM Scaling Easy to show at(i) = c0c1…ct αt(i) (♯) Simple proof by induction So, c0c1…ct is scaling factor at step t Also, easy to show thatat(i) = αt(i)/Σαt(j) Which implies ΣaT-1(i) = 1 (♯♯) HMM 53
- 54. HMM Scaling By combining (♯) and (♯♯), we have1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i) = c0c1…cT-1 P(O|λ) Therefore, P(O|λ) = 1 / c0c1…cT-1 To avoid underflow, we computelog P(O|λ) = -Σlog(cj) Where sum is from j = 0 to T-1 HMM 54
- 55. HMM Scaling Similarly, scale betas as ctβt(i) For re-estimation, Computeγt(i,j)andγt(i)using original formulas, but with scaled alphas and betas This gives us new values for λ = (A,B,π) “Easy exercise” to show re-estimate is exact when scaled alphas and betas used Also, P(O|λ) cancels from formula Use log P(O|λ) = -Σlog(cj) to decide if iterate improves HMM 55
- 56. All Together Now Complete pseudo code for Solution 3 Given: (O0,O1,…,OT-1) and N and M Initialize: λ = (A,B,π) A is NxN, B is NxM and π is 1xN πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row stochastic, but not uniform Initialize: maxIters = max number of re-estimation steps iters = 0 oldLogProb = -∞ HMM 56
- 57. Forward Algorithm Forward algorithm With scaling HMM 57
- 58. Backward Algorithm Backward algorithm or “beta pass” With scaling Note:same scaling factor as alphas HMM 58
- 59. Gammas Using scaled alphas and betas So formulas unchanged HMM 59
- 60. Re-Estimation Again, using scaled gammas So formulas unchanged HMM 60
- 61. Stopping Criteria Checkthat probability increases In practice, want logProb>oldLogPro b+ε And don’t exceed max iterations HMM 61
- 62. English Text Example Suppose Martian arrives on earth Sees written English text Wants to learn something about it Martians know about HMMs So,strip our all non-letters, make all letters lower-case 27 symbols (letters, plus word-space) Train HMM on long sequence of symbols HMM 62
- 63. English Text For first training case, initialize: N = 2 and M = 27 Elements of A and πare about ½ each Elements of B are each about 1/27 We use 50,000 symbols for training After 1stiter: log P(O|λ) ≈ -165097 After 100thiter: log P(O|λ) ≈ -137305 HMM 63
- 64. English Text Matrices A and πconverge: What does this tells us? Started in hidden state 1 (not state 0) And we know transition probabilities between hidden states Nothing too interesting here We don’t care about hidden states HMM 64
- 65. English Text What about B matrix? This much more interesting… Why??? HMM 65
- 66. A Security Application Suppose we want to detect metamorphic computer viruses Such viruses vary their internal structure But function of malware stays same If sufficiently variable, standard signature detection will fail Can we use HMM for detection? What to use as observation sequence? Is there really a “hidden” Markov process? What about N, M, and T? How many Os needed for training, scoring? HMM 66
- 67. HMM for Metamorphic Detection Set of “family” viruses into 2 subsets Extract opcodes from each virus Append opcodes from subset 1 to make one long sequence Train HMM on opcode sequence (problem 3) Obtain a model λ = (A,B,π) Set threshold: score opcodes from files in subset 2 and “normal” files (problem 1) Can you sets a threshold that separates sets? If so, may have a viable detection method HMM 67
- 68. HMM for Metamorphic Detection Virus detection results from recent paper Note the separation This is good! HMM 68
- 69. HMM Generalizations Here, assumed Markov process of order 1 Current state depends only on previous state and transition matrix A Can use higher order Markov process Current state depends on n previous states Higher order vs size of N ? “Depth” vs “width” Can have A and B matrices depend on t HMM often combined with other techniques (e.g., neural nets) HMM 69
- 70. Generalizations Insome cases, limitation of HMM is that position information is not used In many applications this is OK/desirable In some apps, this is a serious problem Bioinformatics applications DNA sequencing, protein alignment, etc. Sequence alignment is crucial They use “profile HMMs” instead of HMMs HMM 70
- 71. ReferencesArevealing introduction to hidden Markov models, by M. Stamp http://www.cs.sjsu.edu/faculty/stamp/RUA /HMM.pdfA tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner http://www.cs.ubc.ca/~murphyk/Bayes/rabi ner.pdf HMM 71
- 72. References Hunting for metamorphic engines, W. Wong and M. Stamp Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229 Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp Journalin Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214 HMM 72

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment