Upcoming SlideShare
×

# Markov Models

7,090 views

Published on

A presentation on Markov Chain, HMM, Markov Random Fields with the needed algorithms and detailed explanations.

Published in: Education
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
7,090
On SlideShare
0
From Embeds
0
Number of Embeds
1,844
Actions
Shares
0
611
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Markov Models

1. 1. PATTERN RECOGNITION Markov models Vu PHAM phvu@fit.hcmus.edu.vn Department of Computer Science March 28th, 201128/03/2011 Markov models 1
2. 2. Contents• Introduction – Introduction – Motivation• Markov Chain• Hidden Markov Models• Markov Random Field 28/03/2011 Markov models 2
3. 3. Introduction• Markov processes are first proposed by Russian mathematician Andrei Markov – He used these processes to investigate Pushkin’s poem.• Nowadays, Markov property and HMMs are widely used in many domains: – Natural Language Processing – Speech Recognition – Bioinformatics – Image/video processing – ... 28/03/2011 Markov models 3
4. 4. Motivation [0]• As shown in his paper in 1906, Markov’s original motivation is purely mathematical: – Application of The Weak Law of Large Number to dependent random variables.• However, we shall not follow this motivation... 28/03/2011 Markov models 4
5. 5. Motivation [1]• From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i 28/03/2011 Markov models 5
6. 6. Motivation [1]• From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i • Classes are independent. • Feature vectors are independent. 28/03/2011 Markov models 6
7. 7. Motivation [1]• From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i – However, there are some applications where various classes are closely realated: • POS Tagging, Tracking, Gene boundary recover... s1 s2 s3 ... sm ... 28/03/2011 Markov models 7
8. 8. Motivation [1]• Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k. 28/03/2011 Markov models 8
9. 9. Motivation [1]• Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.• To apply Bayes classifier: – X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i 28/03/2011 Markov models 9
10. 10. Motivation [1]• Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.• To apply Bayes classifier: – X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i 28/03/2011 Markov models 10
11. 11. Motivation [2]• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables 28/03/2011 Markov models 11
12. 12. Motivation [2]• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm 28/03/2011 Markov models 12
13. 13. Motivation [2]• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)? 28/03/2011 Markov models 13
14. 14. Motivation [2]• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)? p(s1s2... sm-1 sm) p(sm|s1s2...sm-1) = p(s1s2... sm-1) 28/03/2011 Markov models 14
15. 15. Contents• Introduction• Markov Chain• Hidden Markov Models• Markov Random Field 28/03/2011 Markov models 15
16. 16. Markov Chain• Has N states, called s1, s2, ..., sN• There are discrete timesteps, t=0, s2 t=1,... s1• On the t’th timestep the system is in exactly one of the available states. s3 Call it qt ∈ {s1 , s2 ,..., sN } Current state N=3 t=0 q t = q 0 = s3 28/03/2011 Markov models 16
17. 17. Markov Chain• Has N states, called s1, s2, ..., sN• There are discrete timesteps, t=0, s2 t=1,... s1• On the t’th timestep the system is in Current state exactly one of the available states. s3 Call it qt ∈ {s1 , s2 ,..., sN }• Between each timestep, the next state is chosen randomly. N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 17
18. 18. p ( s1 ˚ s2 ) = 1 2Markov Chain p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• Has N states, called s1, s2, ..., sN• There are discrete timesteps, t=0, s2 t=1,... s1• On the t’th timestep the system is in exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 Call it qt ∈ {s1 , s2 ,..., sN } p ( s2 ˚ s1 ) = 0• Between each timestep, the next p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 state is chosen randomly. p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0• The current state determines the probability for the next state. N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 18
19. 19. p ( s1 ˚ s2 ) = 1 2Markov Chain p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• Has N states, called s1, s2, ..., sN 1/2• There are discrete timesteps, t=0, s2 1/2 t=1,... s1 2/3• On the t’th timestep the system is in 1/3 1 exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 Call it qt ∈ {s1 , s2 ,..., sN } p ( s2 ˚ s1 ) = 0• Between each timestep, the next p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 state is chosen randomly. p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0• The current state determines the probability for the next state. N=3 – Often notated with arcs between states t=1 q t = q 1 = s2 28/03/2011 Markov models 19
20. 20. p ( s1 ˚ s2 ) = 1 2Markov Property p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• qt+1 is conditionally independent of 1/2 {qt-1, qt-2,..., q0} given qt. s2 1/2• In other words: s1 2/3 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3 1 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 20
21. 21. p ( s1 ˚ s2 ) = 1 2Markov Property p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• qt+1 is conditionally independent of 1/2 {qt-1, qt-2,..., q0} given qt. s2 1/2• In other words: s1 2/3 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3 1 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 The state at timestep t+1 depends p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 only on the state at timestep t p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 21
22. 22. p ( s1 ˚ s2 ) = 1 2Markov Property p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• qt+1 is conditionally independent of 1/2 {qt-1, qt-2,..., q0} given qt. s2 1/2• In other words: s1 2/3 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3 1 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 The state at timestep t+1 depends p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 only on the state at timestep t p ( s2 ˚ s3 ) = 2 3 A Markov chain of order m (m finite): the state at p ( s3 ˚ s3 ) = 0 timestep t+1 depends on the past m states: N=3 t=1 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt , qt −1 ,..., qt − m +1 ) q t = q 1 = s2 28/03/2011 Markov models 22
23. 23. p ( s1 ˚ s2 ) = 1 2Markov Property p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0• qt+1 is conditionally independent of 1/2 {qt-1, qt-2,..., q0} given qt. s2 1/2• In other words: s1 2/3 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3 1 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3 The state at timestep t+1 depends p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 only on the state at timestep t p ( s2 ˚ s3 ) = 2 3• How to represent the joint p ( s3 ˚ s3 ) = 0 distribution of (q0, q1, q2...) using N=3 graphical models? t=1 q t = q 1 = s2 28/03/2011 Markov models 23
24. 24. p ( s1 ˚ s2 ) = 1 2Markov Property p ( s2 ˚ s2 ) = 1 2 q0p ( s 3 ˚ s2 ) = 0• qt+1 is conditionally independent of 1/2 {qt-1, qt-2,..., q0} given qt. s2 1/2• In other words: q1 s1 1/3 p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3 1 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 q2 s3 The state at timestep t+1 depends p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3 only on the state at timestep t• How to represent the joint q3 p ( s 2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 distribution of (q0, q1, q2...) using N=3 graphical models? t=1 q t = q 1 = s2 28/03/2011 Markov models 24
25. 25. Markov chain• So, the chain of {qt} is called Markov chain q0 q1 q2 q3 28/03/2011 Markov models 25
26. 26. Markov chain• So, the chain of {qt} is called Markov chain q0 q1 q2 q3• Each qt takes value from the countable state-space {s1, s2, s3...}• Each qt is observed at a discrete timestep t• {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) 28/03/2011 Markov models 26
27. 27. Markov chain• So, the chain of {qt} is called Markov chain q0 q1 q2 q3• Each qt takes value from the countable state-space {s1, s2, s3...}• Each qt is observed at a discrete timestep t• {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt )• The transition from qt to qt+1 is calculated from the transition probability matrix 1/2 s1 s2 s3 s2 s1 0 0 1 1/2 s1 s2 ½ ½ 0 2/3 1 1/3 s3 1/3 2/3 0 28/03/2011 s3 Markov models Transition probabilities 27
28. 28. Markov chain• So, the chain of {qt} is called Markov chain q0 q1 q2 q3• Each qt takes value from the countable state-space {s1, s2, s3...}• Each qt is observed at a discrete timestep t• {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt )• The transition from qt to qt+1 is calculated from the transition probability matrix 1/2 s1 s2 s3 s2 s1 0 0 1 1/2 s1 s2 ½ ½ 0 2/3 1 1/3 s3 1/3 2/3 0 28/03/2011 s3 Markov models Transition probabilities 28
29. 29. Markov Chain – Important property• In a Markov chain, the joint distribution is m p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 ) j =1 28/03/2011 Markov models 29
30. 30. Markov Chain – Important property• In a Markov chain, the joint distribution is m p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 ) j =1• Why? m p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 , previous states ) j =1 m = p ( q0 ) ∏ p ( q j | q j −1 ) j =1 Due to the Markov property 28/03/2011 Markov models 30
31. 31. Markov Chain: e.g.• The state-space of weather: rain wind cloud 28/03/2011 Markov models 31
32. 32. Markov Chain: e.g.• The state-space of weather: 1/2 Rain Cloud Wind rain wind Rain ½ 0 ½ 2/3 Cloud 1/3 0 2/3 1/2 1/3 1 cloud Wind 0 1 0 28/03/2011 Markov models 32
33. 33. Markov Chain: e.g.• The state-space of weather: 1/2 Rain Cloud Wind rain wind Rain ½ 0 ½ 2/3 Cloud 1/3 0 2/3 1/2 1/3 1 cloud Wind 0 1 0• Markov assumption: weather in the t+1’th day is depends only on the t’th day. 28/03/2011 Markov models 33
34. 34. Markov Chain: e.g.• The state-space of weather: 1/2 Rain Cloud Wind rain wind Rain ½ 0 ½ 2/3 Cloud 1/3 0 2/3 1/2 1/3 1 cloud Wind 0 1 0• Markov assumption: weather in the t+1’th day is depends only on the t’th day.• We have observed the weather in a week: rain wind cloud rain windDay: 0 1 2 3 4 28/03/2011 Markov models 34
35. 35. Markov Chain: e.g.• The state-space of weather: 1/2 Rain Cloud Wind rain wind Rain ½ 0 ½ 2/3 Cloud 1/3 0 2/3 1/2 1/3 1 cloud Wind 0 1 0• Markov assumption: weather in the t+1’th day is depends only on the t’th day.• We have observed the weather in a week: Markov Chain rain wind cloud rain windDay: 0 1 2 3 4 28/03/2011 Markov models 35
36. 36. Contents• Introduction• Markov Chain• Hidden Markov Models – Independent assumptions – Formal definition – Forward algorithm – Viterbi algorithm – Baum-Welch algorithm• Markov Random Field 28/03/2011 Markov models 36
37. 37. Modeling pairs of sequences• In many applications, we have to model pair of sequences• Examples: – POS tagging in Natural Language Processing (assign each word in a sentence to Noun, Adj, Verb...) – Speech recognition (map acoustic sequences to sequences of words) – Computational biology (recover gene boundaries in DNA sequences) – Video tracking (estimate the underlying model states from the observation sequences) – And many others... 28/03/2011 Markov models 37
38. 38. Probabilistic models for sequence pairs• We have two sequences of random variables: X1, X2, ..., Xm and S1, S2, ..., Sm• Intuitively, in a pratical system, each Xi corresponds to an observation and each Si corresponds to a state that generated the observation.• Let each Si be in {1, 2, ..., k} and each Xi be in {1, 2, ..., o}• How do we model the joint distribution: p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm ) 28/03/2011 Markov models 38
39. 39. Hidden Markov Models (HMMs)• In HMMs, we assume that p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., Sm = sm ) m m = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) ∏ p ( X j = x j ˚ S j = s j ) j =2 j =1• This is often called Independence assumptions in HMMs• We are gonna prove it in the next slides 28/03/2011 Markov models 39
40. 40. Independence Assumptions in HMMs [1] p ( ABC ) = p ( A | BC ) p ( BC ) = p ( A | BC ) p ( B ˚ C ) p ( C )• By the chain rule, the following equality is exact: p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ,..., S m = sm ) × p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm )• Assumption 1: the state sequence forms a Markov chain m p ( S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) j =2 28/03/2011 Markov models 40
41. 41. Independence Assumptions in HMMs [2]• By the chain rule, the following equality is exact: p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm ) m = ∏ p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 ) j =1• Assumption 2: each observation depends only on the underlying state p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 ) = p( X j = xj ˚ S j = sj )• These two assumptions are often called independence assumptions in HMMs 28/03/2011 Markov models 41
42. 42. The Model form for HMMs• The model takes the following form: m m p ( x1 ,.., xm , s1 ,..., sm ;θ ) = π ( s1 ) ∏ t ( s j ˚ s j −1 ) ∏ e ( x j ˚ s j ) j =2 j =1• Parameters in the model: – Initial probabilities π ( s ) for s ∈ {1, 2,..., k } – Transition probabilities t ( s ˚ s′ ) for s, s ∈ {1, 2,..., k } – Emission probabilities e ( x ˚ s ) for s ∈ {1, 2,..., k } and x ∈ {1, 2,.., o} 28/03/2011 Markov models 42
43. 43. 6 components of HMMs start• Discrete timesteps: 1, 2, ...• Finite state space: {si} (N states) π1 π2 π3• Events {xi} (M events) t31 t11 t12 t23 π• Vector of initial probabilities {πi} s1 s2 s3 t21 t32 Π = {πi } = { p(q1 = si) }• Matrix of transition probabilities e13 e11 e23 e33 e31 T = {Tij} = { p(qt+1=sj|qt=si) } e22• Matrix of emission probabilities x1 x2 x3 E = {Eij} = { p(ot=xj|qt=si) } The observations at continuous timesteps form an observation sequence {o1, o2, ..., ot}, where oi ∈ {x1, x2, ..., xM} 28/03/2011 Markov models 43
44. 44. 6 components of HMMs start• Discrete timesteps: 1, 2, ...• Finite state space: {si} (N states) π1 π2 π3• Events {xi} (M events) t31 t11 t12 t23 π• Vector of initial probabilities {πi} s1 s2 s3 t21 t32 Π = {πi } = { p(q1 = si) }• Matrix of transition probabilities e13 e11 e23 e33 e31 T = {Tij} = { p(qt+1=sj|qt=si) } e22• Matrix of emission probabilities x1 x2 x3 E = {Eij} = { p(ot=xj|qt=si) } Constraints: The observations at continuous timesteps form an observation sequence N N M ∑ πi = 1 ∑ ∑ {o1, o2, ..., ot}, where oi ∈ {x1Tij 2=..., xM} Eij = 1 i =1 j =1 ,x , 1 j =1 28/03/2011 Markov models 44
45. 45. 6 components of HMMs start• Given a specific HMM and an observation sequence, the π1 π2 π3 corresponding sequence of states t31 t11 is generally not deterministic t12 t23• Example: s1 t21 s2 t32 s3 Given the observation sequence: e13 e11 e23 e33 {x1, x3, x3, x2} e31 e22 The corresponding states can be any of following sequences: x1 x2 x3 {s1, s2, s1, s2} {s1, s2, s3, s2} {s1, s1, s1, s2} ... 28/03/2011 Markov models 45
46. 46. Here’s an HMM 0.2 0.5 0.5 0.6 s1 0.4 s2 0.8 s3 0.3 0.7 0.9 0.8 0.2 0.1 x1 x2 x3 T s1 s2 s3 E x1 x2 x3 π s1 s2 s3 s1 0.5 0.5 0 s1 0.3 0 0.7 0.3 0.3 0.4 s2 0.4 0 0.6 s2 0 0.1 0.9 s3 0.2 0.8 0 s3 0.2 0 0.828/03/2011 Markov models 46
47. 47. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 0.3 - 0.3 - 0.4π s1 s2 s3 randomply choice between S1, S2, S3 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 o1s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3 28/03/2011 Markov models 47
48. 48. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 0.2 - 0.8π s1 s2 s3 choice between X1 and X3 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3 28/03/2011 Markov models 48
49. 49. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 Go to S2 withπ s1 s2 s3 probability 0.8 or S1 with prob. 0.2 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3 28/03/2011 Markov models 49
50. 50. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 0.3 - 0.7π s1 s2 s3 choice between X1 and X3 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3 28/03/2011 Markov models 50
51. 51. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 Go to S2 withπ s1 s2 s3 probability 0.5 or S1 with prob. 0.5 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3 28/03/2011 Markov models 51
52. 52. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 0.3 - 0.7π s1 s2 s3 choice between X1 and X3 0.3 0.3 0.4T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1s3 0.2 0.8 0 s3 0.2 0 0.8 q3 S1 o3 28/03/2011 Markov models 52
53. 53. Here’s a HMM 0.20.5 • Start randomly in state 1, 2 0.5 0.6 s1 s2 s3 or 3. 0.4 0.8 • Choose a output at each 0.3 0.7 0.9 state in random. 0.2 0.8 0.1 • Let’s generate a sequence of observations: x1 x2 x3 We got a sequence of states andπ s1 s2 s3 corresponding 0.3 0.3 0.4 observations!T s1 s2 s3 E x1 x2 x3s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1s3 0.2 0.8 0 s3 0.2 0 0.8 q3 S1 o3 X3 28/03/2011 Markov models 53
54. 54. Three famous HMM tasks• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:• Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O)• Most likely expaination (inference) – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)• Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 54
55. 55. Three famous HMM tasks• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:• Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the probability of• Most likely expaination (inference) observing the sequence O over all of possible sequences. – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)• Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 55
56. 56. Three famous HMM tasks• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:• Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the best• Most likely expaination (inference) corresponding state sequence, given an observation – Given: Φ, the observation O = {o1, o2,..., ot} sequence. – Goal: Q* = argmaxQ p(Q|O)• Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 56
57. 57. Three famous HMM tasks• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:• Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} Given an (or a set of) – Goal: p(O|Φ), or equivalently p(st = Si|O) observation sequence and• Most likely expaination (inference) corresponding state sequence, – Given: Φ, the observation O = {o1, o2,..., ot} estimate the Transition matrix, – Goal: Q* = argmaxQ p(Q|O) Emission matrix and initial probabilities of the HMM• Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 57
58. 58. Three famous HMM tasks Problem Algorithm Complexity State estimation Forward O(TN2) Calculating: p(O|Φ) Inference Viterbi decoding O(TN2) Calculating: Q*= argmaxQp(Q|O) Learning Baum-Welch (EM) O(TN2) Calculating: Φ* = argmaxΦp(O|Φ) T: number of timesteps N: number of states28/03/2011 Markov models 58
59. 59. State estimation problem• Given: Φ = (T, E, π), observation O = {o1, o2,..., ot}• Goal: What is p(o1o2...ot) ?• We can do this in a slow, stupid way – As shown in the next slide... 28/03/2011 Markov models 59
60. 60. Here’s a HMM0.5 0.2 0.5 0.6 • What is p(O) = p(o1o2o3) s1 0.4 s2 0.8 s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? 0.3 0.7 0.9 • Slow, stupid way: 0.2 0.8 0.1 p (O ) = ∑ p ( OQ ) x1 x2 x3 Q∈paths of length 3 = ∑ Q∈paths of length 3 Q∈ p (O | Q ) p (Q ) • How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q? 28/03/2011 Markov models 60
61. 61. Here’s a HMM0.5 0.2 0.5 0.6 • What is p(O) = p(o1o2o3) s1 0.4 s2 0.8 s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? 0.3 0.7 0.9 • Slow, stupid way: 0.2 0.8 0.1 p (O ) = ∑ p ( OQ ) x1 x2 x3 Q∈paths of length 3 π s1 s2 s3 = ∑ Q∈paths of length 3 Q∈ p (O | Q ) p (Q ) 0.3 0.3 0.4 p(Q) = p(q1q2q3) • How to compute p(Q) for an = p(q1)p(q2|q1)p(q3|q2,q1) (chain) arbitrary path Q? = p(q1)p(q2|q1)p(q3|q2) (why?) • How to compute p(O|Q) for an arbitrary path Q? Example in the case Q=S3S1S1 P(Q) = 0.4 * 0.2 * 0.5 = 0.04 28/03/2011 Markov models 61
62. 62. Here’s a HMM0.5 0.2 0.5 0.6 • What is p(O) = p(o1o2o3) s1 0.4 s2 0.8 s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? 0.3 0.7 0.9 • Slow, stupid way: 0.2 0.8 0.1 p (O ) = ∑ p ( OQ ) x1 x2 x3 Q∈paths of length 3 π s1 s2 s3 = ∑ Q∈paths of length 3 Q∈ p (O | Q ) p (Q ) 0.3 0.3 0.4 p(O|Q) = p(o1o2o3|q1q2q3) • How to compute p(Q) for an = p(o1|q1)p(o2|q1)p(o3|q3) (why?) arbitrary path Q? • How to compute p(O|Q) for an Example in the case Q=S3S1S1 arbitrary path Q? P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1) =0.8 * 0.3 * 0.7 = 0.168 28/03/2011 Markov models 62
63. 63. Here’s a HMM0.5 0.2 0.5 0.6 • What is p(O) = p(o1o2o3) s1 0.4 s2 0.8 s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? 0.3 0.7 0.9 • Slow, stupid way: 0.2 0.8 0.1 p (O ) = ∑ p ( OQ ) x1 x2 x3 Q∈paths of length 3 π s1 s2 s3 = ∑ Q∈paths of length 3 Q∈ p (O | Q ) p (Q ) 0.3 0.3 0.4 p(O|Q) = p(o1o2o3|q1q2q3) • How to compute p(Q) for an p(O) needs 27 p(Q) arbitrary path Q? = p(o1|q1)p(o2|q1)p(o3|q3) (why?) computations and 27 • How to compute p(O|Q) for an p(O|Q) computations. Example in the case Q=S3S1S1 arbitrary path Q? P(O|Q) = p(X3|S3)p(Xsequence3has ) What if the 1|S1) p(X |S1 20 observations? =0.8 * 0.3 * 0.7 = 0.168 So let’s be smarter... 28/03/2011 Markov models 63
64. 64. The Forward algorithm• Given observation o1o2...oT• Forward probabilities: αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T αt(i) = probability that, in a random trial: – We’d have seen the first t observations – We’d have ended up in si as the t’th state visited.• In our example, what is α2(3) ? 28/03/2011 Markov models 64
65. 65. αt(i): easy to define recursively α t ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ ) Π = {π i } = { p ( q1 = si )} α1 ( i ) = p ( o1 ∧ q1 = si ) = p ( q1 = si ) p ( o1 | q1 = si ) { T = {Tij } = p ( qt +1 = s j | qt = si ) } = π i Ei ( o1 ) E = {E } = { p ( o = x ij t j | q = s )} t iα t +1 ( i ) = p ( o1o2 ...ot +1 ∧ qt +1 = si ) N = ∑ p ( o1o2 ...ot ∧ qt = s j ∧ ot +1 ∧ qt +1 = si ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = si | o1o2 ...ot ∧ qt = s j ) p ( o1o2 ...ot ∧ qt = s j ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = si | qt = s j ) α t ( j ) j =1 N = ∑ p ( ot +1 | qt +1 = si ) p ( qt +1 = si | qt = s j ) α t ( j ) j =1 N = ∑T ji Ei ( ot +1 ) α t ( j ) j =1 28/03/2011 Markov models 65
66. 66. In our example 0.5 0.2 αt ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ ) s1 0.5 s2 0.6 s3 0.4 0.8 α1 ( i ) = Ei ( o1 ) π i 0.3 0.7 0.9αt +1 ( i ) = ∑Tji Ei ( ot +1 ) αt ( j ) = Ei ( ot +1 ) ∑Tjiαt ( j ) 0.2 0.1 0.8 j j x1 x2 x3 π s1 s2 s3 0.3 0.3 0.4 We observed: x1x2 α1(1) = 0.3 * 0.3 = 0.09 α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0 α1(2) = 0 α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109 α1(3) = 0.2 * 0.4 = 0.08 α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0 28/03/2011 Markov models 66
67. 67. Forward probabilities - Trellis Ns4s3s2s1 1 2 3 4 5 6 T 28/03/2011 Markov models 67
68. 68. Forward probabilities - Trellis N α1 (4)s4 α1 (3) α2 (3) α6 (3)s3 α1 (2) α3 (2) α5 (2)s2 α1 (1) α4 (1)s1 1 2 3 4 5 6 T 28/03/2011 Markov models 68
69. 69. Forward probabilities - Trellis N α1 ( i ) = Ei ( o1 ) π i α1 (4)s4 α1 (3) α2 (3)s3 α1 (2)s2 α1 (1)s1 1 2 3 4 5 6 T 28/03/2011 Markov models 69
70. 70. Forward probabilities - Trellis N αt +1 ( i ) = Ei ( ot +1 ) ∑Tjiαt ( j ) α1 (4) js4 α1 (3) α2 (3)s3 α1 (2)s2 α1 (1)s1 1 2 3 4 5 6 T 28/03/2011 Markov models 70
71. 71. Forward probabilities• So, we can cheaply compute: αt ( i ) = p ( o1o2 ...ot ∧ qt = si )• How can we cheaply compute: p ( o1 o 2 ...o t )• How can we cheaply compute: p ( q t = s i | o1 o 2 ...o t ) 28/03/2011 Markov models 71
72. 72. Forward probabilities• So, we can cheaply compute: αt ( i ) = p ( o1o2 ...ot ∧ qt = si )• How can we cheaply compute: p ( o1 o 2 ...o t ) = ∑ α (i ) i t• How can we cheaply compute: αt ( i ) p ( q t = s i | o1 o 2 ...o t ) = ∑α t ( j ) j Look back the trellis... 28/03/2011 Markov models 72
73. 73. State estimation problem• State estimation is solved: N p ( O | Φ ) = p ( o1o2 … ot ) = ∑ α i ( i ) i =1• Can we utilize the elegant trellis to solve the Inference problem? – Given an observation sequence O, find the best state sequence Q Q = arg max p ( Q | O ) * Q 28/03/2011 Markov models 73
74. 74. Inference problem• Given: Φ = (T, E, π), observation O = {o1, o2,..., ot}• Goal: Find Q * = arg max p ( Q | O ) Q = arg max p ( q1q2 … qt | o1o2 … ot ) q1q2 … qt• Practical problems: – Speech recognition: Given an utterance (sound), what is the best sentence (text) that matches the utterance? – Video tracking s1 s2 s3 – POS Tagging 28/03/2011 x1 Markov models x2 x3 74
75. 75. Inference problem• We can do this in a slow, stupid way: Q * = arg max p ( Q | O ) Q p (O | Q ) p (Q ) = arg max Q p (O ) = arg max p ( O | Q ) p ( Q ) Q = arg max p ( o1o2 … ot | Q ) p ( Q ) Q• But it’s better if we can find another way to compute the most probability path (MPP)... 28/03/2011 Markov models 75
76. 76. Efficient MPP computation• We are going to compute the following variables: δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot ) q1q2 …qt −1• δt(i) is the probability of the best path of length t-1 which ends up in si and emits o1...ot.• Define: mppt(i) = that path so: δt(i) = p(mppt(i)) 28/03/2011 Markov models 76
77. 77. Viterbi algorithm δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 … ot ) q1q2 …qt −1mppt ( i ) = arg max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot ) q1q2 …qt −1 δ1 ( i ) = max p ( q1 = si ∧ o1 ) one choice = π i Ei ( o1 ) = α1 ( i ) N δ (4) 1 s4 δ 1 (3) s3 δ 2 (3) δ 1 (2) s2 s1 δ 1 (1) 1 2 3 4 5 6 T 28/03/2011 Markov models 77
78. 78. Viterbi algorithm time t time t + 1 • The most prob path with last two states s1 sisj is the most path to si, followed by ... transition si sj. si sj • The prob of that path will be: ... ... δt(i) × p(si sj ∧ ot+1) = δt(i)TijEj(ot+1) • So, the previous state at time t is: i* = arg max δ t ( i ) Tij E j ( ot +1 ) i 28/03/2011 Markov models 78
79. 79. Viterbi algorithm• Summary: δ t +1 ( j ) = δ t ( i* ) Tij E j ( ot +1 ) δ1 ( i ) = π i Ei ( o1 ) = α1 ( i ) mppt +1 ( j ) = mppt i* s j ( ) i* = arg max δ t ( i ) Tij E j ( ot +1 ) i N δ (4) 1 s4 δ 1 (3) s3 δ 2 (3) δ 1 (2) s2 s1 δ 1 (1) 1 2 3 4 5 6 T 28/03/2011 Markov models 79
80. 80. What’s Viterbi used for? • Speech RecognitionChong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large VocabularyContinuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008. 28/03/2011 Markov models 80
81. 81. Training HMMs• Given: large sequence of observation o1o2...oT and number of states N.• Goal: Estimation of parameters Φ = 〈T, E, π〉• That is, how to design an HMM.• We will infer the model from a large amount of data o1o2...oT with a big “T”. 28/03/2011 Markov models 81
82. 82. Training HMMs• Remember, we have just computed p(o1o2...oT | Φ)• Now, we have some observations and we want to inference Φ from them.• So, we could use: – MAX LIKELIHOOD: Φ = arg max p ( o1 … oT | Φ ) Φ – BAYES: Compute p ( Φ | o1 … oT ) then take E [ Φ ] or max p ( Φ | o1 … oT ) Φ 28/03/2011 Markov models 82
83. 83. Max likelihood for HMMs• Forward probability: the probability of producing o1...ot while ending up in state si α1 ( i ) = Ei ( o1 ) π i αt ( i ) = p ( o1o2 ...ot ∧ qt = si ) α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j ) j• Backward probability: the probability of producing ot+1...oT given that at time t, we are at state si βt ( i ) = p ( ot +1ot +2 ...oT | qt = si ) 28/03/2011 Markov models 83
84. 84. Max likelihood for HMMs - Backward• Backward probability: easy to define recursively βt ( i ) = p ( ot +1ot +2 ...oT | qt = si ) βT ( i ) = 1 N βT ( i ) = 1 βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 ) N j =1 βt ( i ) = ∑ p ( ot +1 ∧ ot +2 ...oT ∧ qt +1 = s j | qt = si ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | ot +1 ∧ qt +1 = s j ∧ qt = si ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | qt +1 = s j ) j =1 N = ∑ βt +1 ( j ) Tij E j ( ot +1 ) j =1 28/03/2011 Markov models 84
85. 85. Max likelihood for HMMs• The probability of traversing a certain arc at time t given o1o2...oT: ε ij ( t ) = p ( qt = si ∧ qt +1 = s j | o1o2 …oT ) p ( qt = si ∧ qt +1 = s j ∧ o1o2 …oT ) = p ( o1o2 …oT ) p ( o1o2 … ot ∧ qt = si ) p ( qt = si ∧ qt +1 = s j ) p ( ot +1ot + 2 …oT | qt = si ) = N ∑ p ( o o …o ∧ q i =1 1 2 t t = si ) p ( ot +1ot + 2 … oT | qt = si ) α t ( i ) Tij β t ( i ) ε ij ( t ) = N ∑α (i ) β (i ) i =1 t t 28/03/2011 Markov models 85
86. 86. Max likelihood for HMMs• The probability of being at state si at time t given o1o2...oT: γ i ( t ) = p ( qt = si | o1o2 …oT ) N = ∑ p ( qt = si ∧ qt +1 = s j | o1o2 …oT ) j =1 N γ i ( t ) = ∑ ε ij ( t ) j =1 28/03/2011 Markov models 86
87. 87. Max likelihood for HMMs• Sum over the time index: – Expected # of transitions from state i to j in o1o2...oT: T −1 ∑ ε (t ) t =1 ij – Expected # of transitions from state i in o1o2...oT : T −1 T −1 N N T −1 ∑ γ ( t ) = ∑∑ ε ( t ) = ∑ ∑ε ( t ) t =1 i t =1 j =1 ij j =1 t =1 ij 28/03/2011 Markov models 87
88. 88. Π = {π i } = { p ( q1 = si )}Update parameters { T = {Tij } = p ( qt +1 = s j | qt = si ) } E = {E } = { p ( o = x ij t j | q = s )} t iπ i = expected frequency in state i at time t = 1 = γ i (1) ˆ T −1 T −1 expected # of transitions from state i to j ∑ ε (t ) t =1 ij ∑ ε (t ) t =1 ijTij = = T −1 = N T −1 expected # of transitions from state i ∑ γ ( t ) ∑∑ ε ( t ) t =1 i j =1 t =1 ij expected # of transitions from state i with x k observedEik = expected # of transitions from state i T −1 N T −1 ∑ δ ( o , x ) γ ( t ) ∑∑ δ ( o , x ) ε ( t ) t =1 t k i j =1 t =1 t k ij = T −1 = N T −1 ∑ γ (t ) t =1 i ∑∑ ε ( t ) j =1 t =1 ij 28/03/2011 Markov models 88
89. 89. The inner loop of Forward-BackwardGiven an input sequence.1. Calculate forward probability: – Base case: α1 ( i ) = Ei ( o1 ) π i – Recursive case: α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j ) j2. Calculate backward probability: – Base case: βT ( i ) = 1 N – Recursive case: βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 ) j =1 α t ( i ) Tij βt ( i )3. Calculate expected count: ε ij ( t ) = N4. Update parameters: ∑α (i ) β (i ) i =1 t t T −1 N T −1 ∑ ε ij ( t ) ∑∑ δ ( o , x ) ε ( t ) j =1 t =1 t k ij t =1 Tij = N T −1 Eik = N T −1 ∑∑ ε ( t ) j =1 t =1 ij ∑∑ ε ( t ) j =1 t =1 ij 28/03/2011 Markov models 89
90. 90. Forward-Backward: EM for HMM• If we knew Φ we could estimate expectations of quantities such as – Expected number of times in state i – Expected number of transitions i j• If we knew the quantities such as – Expected number of times in state i – Expected number of transitions i j we could compute the max likelihood estimate of Φ = 〈T, E, Π〉• Also known (for the HMM case) as the Baum-Welch algorithm. 28/03/2011 Markov models 90
91. 91. EM for HMM• Each iteration provides values for all the parameters• The new model always improve the likeliness of the training data: ( ˆ ) p o1o2 …oT | Φ ≥ p ( o1o2 …oT | Φ )• The algorithm does not guarantee to reach global maximum. 28/03/2011 Markov models 91
92. 92. EM for HMM• Bad News – There are lots of local minima• Good News – The local minima are usually adequate models of the data.• Notice – EM does not estimate the number of states. That must be given (tradeoffs) – Often, HMMs are forced to have some links with zero probability. This is done by setting Tij = 0 in initial estimate Φ(0) – Easy extension of everything seen today: HMMs with real valued outputs 28/03/2011 Markov models 92
93. 93. Contents• Introduction• Markov Chain• Hidden Markov Models• Markov Random Field (from the viewpoint of classification) 28/03/2011 Markov models 93
94. 94. Example: Image segmentation• Observations: pixel values• Hidden variable: class of each pixel• It’s reasonable to think that there are some underlying relationships between neighbouring pixels... Can we use Markov models?• Errr.... the relationships are in 2D! 28/03/2011 Markov models 94
95. 95. MRF as a 2D generalization of MC• Array of observations: X = { xij } , 0 ≤ i < Nx , 0 ≤ j < N y• Classes/States: S = {sij } , sij = 1...M• Our objective is classification: given the array of observations, estimate the corresponding values of the state array S so that p( X | S ) p(S ) is maximum. 28/03/2011 Markov models 95
96. 96. 2D context-dependent classification• Assumptions: – The values of elements in S are mutually dependent. – The range of this dependence is limited within a neighborhood.• For each (i, j) element of S, a neighborhood Nij is defined so that – sij ∉ Nij: (i, j) element does not belong to its own set of neighbors. – sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor of sij 28/03/2011 Markov models 96
97. 97. 2D context-dependent classification• The Markov property for 2D case: ( ) p sij | Sij = p ( sij | N ij ) where Sij includes all the elements of S except the (i, j) one.• The elegeant dynamic programing is not applicable: the problem is much harder now! 28/03/2011 Markov models 97
98. 98. 2D context-dependent classification• The Markov property for 2D case: ( ) p sij | Sij = p ( sij | N ij ) where Sij includes all the elements of S except the (i, j) one. We are gonna see an• The elegeant dynamic programing is not applicable: the problem is application of MRF for much harder now! Image Segmentation and Restoration. 28/03/2011 Markov models 98
99. 99. MRF for Image Segmentation• Cliques: a set of each pixel which are neighbors of each other (w.r.t the type of neighborhood) 28/03/2011 Markov models 99
100. 100. MRF for Image Segmentation• Dual Lattice number• Line process: 28/03/2011 Markov models 100
101. 101. MRF for Image Segmentation• Gibbs distribution: 1 −U ( s ) π ( s ) = exp Z T – Z: normalizing constant – T: parameter• It turns out that Gibbs distribution implies MRF ([Gema 84]) 28/03/2011 Markov models 101
102. 102. MRF for Image Segmentation• A Gibbs conditional probability is of the form: 1  1  p ( sij | N ij ) = exp  − ∑ Fk ( Ck ( i, j ) )  Z  T k  – Ck(i, j): clique of the pixel (i, j) – Fk: some functions, e.g. 1 ( − sij α1 + α 2 ( si −1, j + si +1, j ) + α 2 ( si , j −1 + si , j +1 ) T ) 28/03/2011 Markov models 102
103. 103. MRF for Image Segmentation• Then, the joint probability for the Gibbs model is  ∑∑ Fk ( Ck ( i, j ) )   i, j k  p ( S ) = exp  −   T    – The sum is calculated over all possible cliques associated with the neighborhood.• We also need to work out p(X|S)• Then p(X|S)p(S) can be maximized... [Gema 84] 28/03/2011 Markov models 103
104. 104. More on Markov models...• MRF does not stop there... Here are some related models: – Conditional random field (CRF) – Graphical models – ...• Markov Chain and HMM does not stop there... – Markov chain of order m – Continuous-time Markov chains... – Real-value observations – ... 28/03/2011 Markov models 104
105. 105. What you should know• Markov property, Markov Chain• HMM: – Defining and computing αt(i) – Viterbi algorithm – Outline of the EM algorithm for HMM• Markov Random Field – And an application in Image Segmentation – [Geman 84] for more information. 28/03/2011 Markov models 105
106. 106. Q&A28/03/2011 Markov models 106
107. 107. References• L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.• Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/• Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 6(6), pp. 721-741, 1984. 28/03/2011 Markov models 107