Probabilistic Models
for Sequence Data
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
Random Sequence Data
โ€ข ๐‘‹1, ๐‘‹2, โ€ฆ , ๐‘‹๐‘‡
โ€ข Subscript denotes steps in โ€œtimeโ€
โ€ข Many applications
โ€ข NLP, โ€œArrivalโ€ data, Weather, Stocks, Biology, โ€ฆ
โ€ข Naรฏve models
โ€ข Complete independence โ€“ too simple
โ€ข Complete pairwise dependence โ€“ too many parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
Markov Model
โ€ข ๐‘ƒ(๐‘‹1: ๐‘‡) = ๐‘ƒ(๐‘‹1)(๐‘ƒ๐‘‹2|๐‘‹1) โ€ฆ ๐‘ƒ(๐‘‹๐‘‡โˆ’1|๐‘‹๐‘‡)
โ€ข Markov (conditional independence) assumption
โ€ข ๐‘‹๐‘ก+1 โ‹ฎ ๐‘‹๐‘กโˆ’๐‘– | ๐‘‹๐‘ก for all ๐‘–
โ€ข Future conditionally independent of past given present
โ€ข Still too many parameters
โ€ข Homogeneous / stationary: ๐‘ƒ(๐‘‹๐‘ก+1|๐‘‹๐‘ก) is same for all ๐‘ก
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3
๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡โˆ’1 ๐‘‹๐‘‡
State Transition Matrix
โ€ข For discrete ๐‘‹๐‘–
โ€ข State transition matrix ๐ด
โ€ข ๐ด๐‘–๐‘— = ๐‘ƒ(๐‘‹๐‘ก+1 = ๐‘—|๐‘‹๐‘ก = ๐‘–)
โ€ข Stochastic matrix
โ€ข N-step matrix ๐ด(๐‘›)
โ€ข ๐ด๐‘–๐‘—(๐‘›) = ๐‘ƒ(๐‘‹๐‘ก+๐‘› = ๐‘—|๐‘‹๐‘ก = ๐‘–)
โ€ข Chapman Kolmogorov Equation
โ€ข ๐ด๐‘–๐‘—(๐‘š + ๐‘›) = ๐‘˜ ๐ด๐‘–๐‘˜ ๐‘š ๐ด๐‘˜๐‘—(๐‘›)
โ€ข ๐ด(๐‘š + ๐‘›) = ๐ด(๐‘š)๐ด(๐‘›)
โ€ข ๐ด(๐‘›) = ๐ด๐‘›
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
MLE for a Markov Chain
โ€ข Data: Multiple sequences
โ€ข ๐ท = {๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘}, where ๐‘ฅ๐‘– = {๐‘ฅ๐‘–1, โ€ฆ ๐‘ฅ๐‘–๐‘‡๐‘–
}
โ€ข Formulate loglikelihood
๐‘–
๐‘ ๐‘ฅ๐‘–1
๐‘—
๐‘(๐‘ฅ๐‘–๐‘—|๐‘ฅ๐‘–๐‘—โˆ’1) =
๐‘—
๐œ‹๐‘—
๐‘๐‘—
1
๐‘–๐‘—
๐ด๐‘–๐‘—
๐‘๐‘–๐‘—
where ๐‘๐‘—
1
, ๐‘๐‘–๐‘— denote โ€ฆ
โ€ข Maximize wrt parameters
๐œ‹๐‘— =
๐‘๐‘—
1
๐‘— ๐‘๐‘—
1 ๐ด๐‘–๐‘— =
๐‘๐‘–๐‘—
๐‘— ๐‘๐‘–๐‘—
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
Digression: Stationary Distribution
โ€ข Does a Markov Chain ever stabilize, ie ๐œ‹๐‘ก+1 = ๐œ‹๐‘ก?
โ€ข Only if A satisfies particular properties
โ€ข Ergodic Markov Chains
โ€ข Without Ergodic Markov Chains our daily lives would
come to a standstill โ€ฆ
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
Hidden Markov Model
โ€ข Sequence over discrete states
โ€ข States generate emissions
โ€ข Emissions are observed, but states are hidden
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
Motivations
โ€ข States have physical โ€œmeaningโ€
โ€ข Speech recognition
โ€ข Speaker diaritization
โ€ข Part of speech tagging
โ€ข โ€ฆ
โ€ข Long range dependencies in Markov Model with a
few parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
Hidden Markov Model
โ€ข Conditional Independence assumptions
๐‘๐‘ก+1 โ‹ฎ ๐‘๐‘กโˆ’1| ๐‘๐‘ก
๐‘‹๐‘ก ๐‘๐‘ก
โ€ฒ
, ๐‘‹๐‘ก
โ€ฒโ€ฒ
๐‘๐‘ก
โ€ข Likelihood
๐‘ƒ ๐‘, ๐‘‹ = ๐‘ƒ ๐‘1 ๐‘ƒ ๐‘‹1 ๐‘1
๐‘ก
๐‘ƒ ๐‘๐‘ก ๐‘๐‘กโˆ’1 ๐‘ƒ(๐‘‹๐‘ก|๐‘๐‘ก)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
Hidden Markov Model: Generative Process
1. Sample an initial state ๐‘1 โˆผ ๐‘(๐‘ง1)
2. Sample an initial output from initial state: ๐‘‹1 โˆผ ๐‘(๐‘ฅ1|๐‘ง1)
3. Repeat for N steps
4. Sample curr state based on prev. state ๐‘๐‘ก โˆผ ๐‘(๐‘ง๐‘ก|๐‘ง๐‘กโˆ’1)
5. Sample curr output based on curr state ๐‘‹๐‘ก โˆผ ๐‘(๐‘ฅ๐‘ก|๐‘ง๐‘ก)
โ€ข Compare with iid generative mixture models
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
Inference Problems
โ€ข Filtering
๐‘ƒ(๐‘ง๐‘ก|๐‘ฅ1:๐‘ก)
โ€ข Smoothing
๐‘ƒ(๐‘ง๐‘กโˆ’๐‘™|๐‘ฅ1:๐‘ก) for ๐‘™ = 1,2, โ€ฆ
โ€ข Prediction
๐‘ƒ ๐‘ฅ๐‘ก+โ„Ž ๐‘ฅ1:๐‘ก for โ„Ž = 1,2, โ€ฆ
โ€ข MAP estimation / Viterbi decoding
arg max
๐‘ง
๐‘ƒ(๐‘ง1:๐‘‡|๐‘ฅ1:๐‘‡)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
Inference: Challenge
โ€ข Naรฏve summation does not work
๐‘ ๐‘ง๐‘ก ๐‘ฅ =
๐‘งโˆ’๐‘ก
๐‘(๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘ก, โ€ฆ , ๐‘ง๐‘‡, ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡)
๐‘ง1,โ€ฆ,๐‘ง๐‘‡
๐‘(๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘ก, โ€ฆ , ๐‘ง๐‘‡, ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡)
โ€ข What is the cost?
โ€ข Solution: Identify and reuse shared computations
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
Forward Algorithm
Define ๐›ผ๐‘ก ๐‘— โ‰ก ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘ก
= ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ๐‘ก, ๐‘ฅ1:๐‘กโˆ’1
โˆ ๐‘ ๐‘ฅ๐‘ก ๐‘ง๐‘ก = ๐‘— ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘กโˆ’1 (Derivation? Normalization?)
= ๐‘ ๐‘ฅ๐‘ก ๐‘ง๐‘ก = ๐‘—
๐‘–
๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ง๐‘กโˆ’1 = ๐‘– ๐‘(๐‘ง๐‘กโˆ’1 = ๐‘–|๐‘ฅ1:๐‘กโˆ’1)
= ๐œ“๐‘ก(๐‘—)
๐‘–
๐ด(๐‘—, ๐‘–)๐›ผ๐‘กโˆ’1(๐‘–)
โ€ข Simple recursive algorithm
โ€ข Base case: ๐›ผ0 ๐‘— = ๐œ“0 ๐‘— ๐œ‹๐‘—
โ€ข Complexity ๐‘‚(๐พ2
๐‘‡) where K is #state values
โ€ข Compare with iid models
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
Backward Algorithm
Define ๐›ฝ๐‘ก ๐‘— โ‰ก ๐‘ ๐‘ฅ๐‘ก+1:๐‘‡ ๐‘ง๐‘ก = ๐‘—
=
๐‘–
๐‘ ๐‘ง๐‘ก+1 = ๐‘–, ๐‘ฅ๐‘ก+1, ๐‘ฅ๐‘ก+2:๐‘‡ |๐‘ง๐‘ก = ๐‘—)
=
๐‘–
๐‘ ๐‘ฅ๐‘ก+2:๐‘‡ ๐‘ง๐‘ก+1 = ๐‘– ๐‘(๐‘ง๐‘ก+1 = ๐‘–, ๐‘ฅ๐‘ก+1|๐‘ง๐‘ก = ๐‘—)
=
๐‘–
๐‘ ๐‘ฅ๐‘ก+2:๐‘‡ ๐‘ง๐‘ก+1 = ๐‘– ๐‘ ๐‘ฅ๐‘ก+1 ๐‘ง๐‘ก+1 = ๐‘– ๐‘ ๐‘ง๐‘ก+1 = ๐‘– ๐‘ง๐‘ก = ๐‘—
=
๐‘–
๐›ฝ๐‘ก+1 ๐‘– ๐œ“๐‘ก+1 ๐‘– ๐ด(๐‘—, ๐‘–)
โ€ข Simple recursive algorithm
โ€ข Base case?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
Forward Backward Algorithm
Define ๐›พ๐‘ก ๐‘— = ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘‡
โˆ ๐›ผ๐‘ก ๐‘— ๐›ฝ๐‘ก(๐‘—)
โ€ข Overall algorithm
โ€ข Compute forward pass
โ€ข Compute backward pass
โ€ข Combine to get ๐›พ๐‘ก ๐‘—
โ€ข Sum product algorithm
โ€ข Belief propagation / message passing algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
Viterbi Algorithm
โ€ข Most likely sequence of states
๐‘งโˆ— = arg max
๐‘ง1:๐‘‡
๐‘(๐‘ง1:๐‘‡|๐‘ฅ1:๐‘‡)
โ€ข Not the same as sequence of most likely states
arg max
๐‘ง1
๐‘(๐‘ง1|๐‘ฅ1:๐‘‡) , arg max
๐‘ง2
๐‘(๐‘ง2|๐‘ฅ1:๐‘‡),โ€ฆ, arg max
๐‘ง๐‘‡
๐‘(๐‘ง๐‘‡|๐‘ฅ1:๐‘‡)
โ€ข Forward Backward + additional bookkeeping
โ€ข Track โ€œtrellisโ€ of states
โ€ข Max product algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
Parameter Estimation
โ€ข Straight forward for fully observed data
โ€ข Homework
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
EM for HMMs (Baum Welch Algorithm)
โ€ข Formulate complete data loglikelihood
๐‘™ ฮ˜
=
๐‘˜
๐‘๐‘˜
1
log ๐œ‹๐‘˜ +
๐‘— ๐‘—โ€ฒ
๐‘๐‘—๐‘—โ€ฒ log ๐ด๐‘—๐‘—โ€ฒ
+
๐‘– ๐‘ก ๐‘˜
๐›ฟ ๐‘ง๐‘ก, ๐‘˜ log ๐‘(๐‘ฅ๐‘–๐‘ก|๐œ™๐‘˜)
โ€ข Formulate expectation given current parameters
๐‘„ ฮ˜, ๐œƒ๐‘œ๐‘™๐‘‘
=
๐‘˜
๐ธ[๐‘๐‘˜
1
] log ๐œ‹๐‘˜ +
๐‘— ๐‘—โ€ฒ
๐ธ[๐‘๐‘—๐‘—โ€ฒ] log ๐ด๐‘—๐‘—โ€ฒ
+
๐‘– ๐‘ก ๐‘˜
๐‘ ๐‘ง๐‘ก = ๐‘˜|๐‘ฅ๐‘–๐‘ก log ๐‘(๐‘ฅ๐‘–๐‘ก|๐œ™๐‘˜)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
E-step
โ€ข Compute the current expectations
๐ธ ๐‘๐‘˜
1
=
๐‘–
๐‘(๐‘ง๐‘–1 = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘)
Use backward algorithm
๐ธ ๐‘๐‘—๐‘˜ =
๐‘– ๐‘ก
๐‘(๐‘ง๐‘–,๐‘กโˆ’1 = ๐‘—, ๐‘ง๐‘–,๐‘ก = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘)
Use extension of forward-backward (HW)
๐ธ ๐‘๐‘˜ =
๐‘– ๐‘ก
๐‘(๐‘ง๐‘–,๐‘ก = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘)
Use forward-backward
โ€ข Compare with iid models
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
M-step
โ€ข Maximize expected complete loglikelihood using
current expected sufficient statistics
๐ด๐‘—๐‘˜ =
๐ธ ๐‘๐‘—๐‘˜
๐‘˜โ€ฒ ๐ธ[๐‘๐‘—๐‘˜โ€ฒ]
๐œ‹๐‘˜ =
๐ธ ๐‘๐‘˜
1
๐‘
โ€ข Emission parameter estimates depend on emission
distribution
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
Choosing number of hidden states
โ€ข Similar to choosing number of components in
mixture model
โ€ข One possibility: Cross validation
โ€ข Computationally more costly
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
Many, many generalizations โ€ฆ.
โ€ข Work horse of signal processing, e.g. speech for
decades
โ€ข Continuous data
โ€ข Long range dependencies
โ€ข Multiple hidden layers
โ€ข Handling inputs
โ€ข โ€ฆ
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
Linear Chain Conditional Random Fields
โ€ข Discriminative Markov Model (recall NB vs LR)
๐‘ ๐‘ง ๐‘ฅ =
1
๐‘(๐‘ฅ)
๐‘ก
exp
๐‘˜
๐œƒ๐‘˜๐‘“๐‘˜(๐‘ง๐‘ก, ๐‘ง๐‘กโˆ’1,๐‘ฅ๐‘ก)
Where ๐‘ ๐‘ฅ = ๐‘ง ๐‘ก exp ๐‘˜ ๐œƒ๐‘˜๐‘“๐‘˜(๐‘ง๐‘ก, ๐‘ง๐‘กโˆ’1,๐‘ฅ๐‘ก)
โ€ข Forward backward algorithm for inference
โ€ข Gradient descent instead of EM for par estimation
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
State Space Models
โ€ข HMM with continuous hidden states
โ€ข Linear dynamical systems
โ€ข Conditional distributions are linear-Gaussian
โ€ข Mathematically tractable inference
โ€ข Kalman filter
โ€ข Widely used in time-series analysis + forecasting,
object tracking, robotics, etc
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
Probabilistic Graphical Models
โ€ข Used for modelling conditional independences in
joint distributions using large no. of RVs
โ€ข Directed Graphical Models / Bayes Nets
โ€ข Undirected Graphical Models / Random Fields
โ€ข Inference computationally hard in general
โ€ข Parameter estimation with hidden variables harder
โ€ข Probability theory + graph theory
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 25

lecture7.pptx

  • 1.
    Probabilistic Models for SequenceData Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
  • 2.
    Random Sequence Data โ€ข๐‘‹1, ๐‘‹2, โ€ฆ , ๐‘‹๐‘‡ โ€ข Subscript denotes steps in โ€œtimeโ€ โ€ข Many applications โ€ข NLP, โ€œArrivalโ€ data, Weather, Stocks, Biology, โ€ฆ โ€ข Naรฏve models โ€ข Complete independence โ€“ too simple โ€ข Complete pairwise dependence โ€“ too many parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
  • 3.
    Markov Model โ€ข ๐‘ƒ(๐‘‹1:๐‘‡) = ๐‘ƒ(๐‘‹1)(๐‘ƒ๐‘‹2|๐‘‹1) โ€ฆ ๐‘ƒ(๐‘‹๐‘‡โˆ’1|๐‘‹๐‘‡) โ€ข Markov (conditional independence) assumption โ€ข ๐‘‹๐‘ก+1 โ‹ฎ ๐‘‹๐‘กโˆ’๐‘– | ๐‘‹๐‘ก for all ๐‘– โ€ข Future conditionally independent of past given present โ€ข Still too many parameters โ€ข Homogeneous / stationary: ๐‘ƒ(๐‘‹๐‘ก+1|๐‘‹๐‘ก) is same for all ๐‘ก Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3 ๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡โˆ’1 ๐‘‹๐‘‡
  • 4.
    State Transition Matrix โ€ขFor discrete ๐‘‹๐‘– โ€ข State transition matrix ๐ด โ€ข ๐ด๐‘–๐‘— = ๐‘ƒ(๐‘‹๐‘ก+1 = ๐‘—|๐‘‹๐‘ก = ๐‘–) โ€ข Stochastic matrix โ€ข N-step matrix ๐ด(๐‘›) โ€ข ๐ด๐‘–๐‘—(๐‘›) = ๐‘ƒ(๐‘‹๐‘ก+๐‘› = ๐‘—|๐‘‹๐‘ก = ๐‘–) โ€ข Chapman Kolmogorov Equation โ€ข ๐ด๐‘–๐‘—(๐‘š + ๐‘›) = ๐‘˜ ๐ด๐‘–๐‘˜ ๐‘š ๐ด๐‘˜๐‘—(๐‘›) โ€ข ๐ด(๐‘š + ๐‘›) = ๐ด(๐‘š)๐ด(๐‘›) โ€ข ๐ด(๐‘›) = ๐ด๐‘› Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
  • 5.
    MLE for aMarkov Chain โ€ข Data: Multiple sequences โ€ข ๐ท = {๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘}, where ๐‘ฅ๐‘– = {๐‘ฅ๐‘–1, โ€ฆ ๐‘ฅ๐‘–๐‘‡๐‘– } โ€ข Formulate loglikelihood ๐‘– ๐‘ ๐‘ฅ๐‘–1 ๐‘— ๐‘(๐‘ฅ๐‘–๐‘—|๐‘ฅ๐‘–๐‘—โˆ’1) = ๐‘— ๐œ‹๐‘— ๐‘๐‘— 1 ๐‘–๐‘— ๐ด๐‘–๐‘— ๐‘๐‘–๐‘— where ๐‘๐‘— 1 , ๐‘๐‘–๐‘— denote โ€ฆ โ€ข Maximize wrt parameters ๐œ‹๐‘— = ๐‘๐‘— 1 ๐‘— ๐‘๐‘— 1 ๐ด๐‘–๐‘— = ๐‘๐‘–๐‘— ๐‘— ๐‘๐‘–๐‘— Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
  • 6.
    Digression: Stationary Distribution โ€ขDoes a Markov Chain ever stabilize, ie ๐œ‹๐‘ก+1 = ๐œ‹๐‘ก? โ€ข Only if A satisfies particular properties โ€ข Ergodic Markov Chains โ€ข Without Ergodic Markov Chains our daily lives would come to a standstill โ€ฆ Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
  • 7.
    Hidden Markov Model โ€ขSequence over discrete states โ€ข States generate emissions โ€ข Emissions are observed, but states are hidden Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
  • 8.
    Motivations โ€ข States havephysical โ€œmeaningโ€ โ€ข Speech recognition โ€ข Speaker diaritization โ€ข Part of speech tagging โ€ข โ€ฆ โ€ข Long range dependencies in Markov Model with a few parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
  • 9.
    Hidden Markov Model โ€ขConditional Independence assumptions ๐‘๐‘ก+1 โ‹ฎ ๐‘๐‘กโˆ’1| ๐‘๐‘ก ๐‘‹๐‘ก ๐‘๐‘ก โ€ฒ , ๐‘‹๐‘ก โ€ฒโ€ฒ ๐‘๐‘ก โ€ข Likelihood ๐‘ƒ ๐‘, ๐‘‹ = ๐‘ƒ ๐‘1 ๐‘ƒ ๐‘‹1 ๐‘1 ๐‘ก ๐‘ƒ ๐‘๐‘ก ๐‘๐‘กโˆ’1 ๐‘ƒ(๐‘‹๐‘ก|๐‘๐‘ก) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
  • 10.
    Hidden Markov Model:Generative Process 1. Sample an initial state ๐‘1 โˆผ ๐‘(๐‘ง1) 2. Sample an initial output from initial state: ๐‘‹1 โˆผ ๐‘(๐‘ฅ1|๐‘ง1) 3. Repeat for N steps 4. Sample curr state based on prev. state ๐‘๐‘ก โˆผ ๐‘(๐‘ง๐‘ก|๐‘ง๐‘กโˆ’1) 5. Sample curr output based on curr state ๐‘‹๐‘ก โˆผ ๐‘(๐‘ฅ๐‘ก|๐‘ง๐‘ก) โ€ข Compare with iid generative mixture models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
  • 11.
    Inference Problems โ€ข Filtering ๐‘ƒ(๐‘ง๐‘ก|๐‘ฅ1:๐‘ก) โ€ขSmoothing ๐‘ƒ(๐‘ง๐‘กโˆ’๐‘™|๐‘ฅ1:๐‘ก) for ๐‘™ = 1,2, โ€ฆ โ€ข Prediction ๐‘ƒ ๐‘ฅ๐‘ก+โ„Ž ๐‘ฅ1:๐‘ก for โ„Ž = 1,2, โ€ฆ โ€ข MAP estimation / Viterbi decoding arg max ๐‘ง ๐‘ƒ(๐‘ง1:๐‘‡|๐‘ฅ1:๐‘‡) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
  • 12.
    Inference: Challenge โ€ข Naรฏvesummation does not work ๐‘ ๐‘ง๐‘ก ๐‘ฅ = ๐‘งโˆ’๐‘ก ๐‘(๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘ก, โ€ฆ , ๐‘ง๐‘‡, ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡) ๐‘ง1,โ€ฆ,๐‘ง๐‘‡ ๐‘(๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘ก, โ€ฆ , ๐‘ง๐‘‡, ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡) โ€ข What is the cost? โ€ข Solution: Identify and reuse shared computations Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
  • 13.
    Forward Algorithm Define ๐›ผ๐‘ก๐‘— โ‰ก ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘ก = ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ๐‘ก, ๐‘ฅ1:๐‘กโˆ’1 โˆ ๐‘ ๐‘ฅ๐‘ก ๐‘ง๐‘ก = ๐‘— ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘กโˆ’1 (Derivation? Normalization?) = ๐‘ ๐‘ฅ๐‘ก ๐‘ง๐‘ก = ๐‘— ๐‘– ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ง๐‘กโˆ’1 = ๐‘– ๐‘(๐‘ง๐‘กโˆ’1 = ๐‘–|๐‘ฅ1:๐‘กโˆ’1) = ๐œ“๐‘ก(๐‘—) ๐‘– ๐ด(๐‘—, ๐‘–)๐›ผ๐‘กโˆ’1(๐‘–) โ€ข Simple recursive algorithm โ€ข Base case: ๐›ผ0 ๐‘— = ๐œ“0 ๐‘— ๐œ‹๐‘— โ€ข Complexity ๐‘‚(๐พ2 ๐‘‡) where K is #state values โ€ข Compare with iid models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
  • 14.
    Backward Algorithm Define ๐›ฝ๐‘ก๐‘— โ‰ก ๐‘ ๐‘ฅ๐‘ก+1:๐‘‡ ๐‘ง๐‘ก = ๐‘— = ๐‘– ๐‘ ๐‘ง๐‘ก+1 = ๐‘–, ๐‘ฅ๐‘ก+1, ๐‘ฅ๐‘ก+2:๐‘‡ |๐‘ง๐‘ก = ๐‘—) = ๐‘– ๐‘ ๐‘ฅ๐‘ก+2:๐‘‡ ๐‘ง๐‘ก+1 = ๐‘– ๐‘(๐‘ง๐‘ก+1 = ๐‘–, ๐‘ฅ๐‘ก+1|๐‘ง๐‘ก = ๐‘—) = ๐‘– ๐‘ ๐‘ฅ๐‘ก+2:๐‘‡ ๐‘ง๐‘ก+1 = ๐‘– ๐‘ ๐‘ฅ๐‘ก+1 ๐‘ง๐‘ก+1 = ๐‘– ๐‘ ๐‘ง๐‘ก+1 = ๐‘– ๐‘ง๐‘ก = ๐‘— = ๐‘– ๐›ฝ๐‘ก+1 ๐‘– ๐œ“๐‘ก+1 ๐‘– ๐ด(๐‘—, ๐‘–) โ€ข Simple recursive algorithm โ€ข Base case? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
  • 15.
    Forward Backward Algorithm Define๐›พ๐‘ก ๐‘— = ๐‘ ๐‘ง๐‘ก = ๐‘— ๐‘ฅ1:๐‘‡ โˆ ๐›ผ๐‘ก ๐‘— ๐›ฝ๐‘ก(๐‘—) โ€ข Overall algorithm โ€ข Compute forward pass โ€ข Compute backward pass โ€ข Combine to get ๐›พ๐‘ก ๐‘— โ€ข Sum product algorithm โ€ข Belief propagation / message passing algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
  • 16.
    Viterbi Algorithm โ€ข Mostlikely sequence of states ๐‘งโˆ— = arg max ๐‘ง1:๐‘‡ ๐‘(๐‘ง1:๐‘‡|๐‘ฅ1:๐‘‡) โ€ข Not the same as sequence of most likely states arg max ๐‘ง1 ๐‘(๐‘ง1|๐‘ฅ1:๐‘‡) , arg max ๐‘ง2 ๐‘(๐‘ง2|๐‘ฅ1:๐‘‡),โ€ฆ, arg max ๐‘ง๐‘‡ ๐‘(๐‘ง๐‘‡|๐‘ฅ1:๐‘‡) โ€ข Forward Backward + additional bookkeeping โ€ข Track โ€œtrellisโ€ of states โ€ข Max product algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
  • 17.
    Parameter Estimation โ€ข Straightforward for fully observed data โ€ข Homework Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
  • 18.
    EM for HMMs(Baum Welch Algorithm) โ€ข Formulate complete data loglikelihood ๐‘™ ฮ˜ = ๐‘˜ ๐‘๐‘˜ 1 log ๐œ‹๐‘˜ + ๐‘— ๐‘—โ€ฒ ๐‘๐‘—๐‘—โ€ฒ log ๐ด๐‘—๐‘—โ€ฒ + ๐‘– ๐‘ก ๐‘˜ ๐›ฟ ๐‘ง๐‘ก, ๐‘˜ log ๐‘(๐‘ฅ๐‘–๐‘ก|๐œ™๐‘˜) โ€ข Formulate expectation given current parameters ๐‘„ ฮ˜, ๐œƒ๐‘œ๐‘™๐‘‘ = ๐‘˜ ๐ธ[๐‘๐‘˜ 1 ] log ๐œ‹๐‘˜ + ๐‘— ๐‘—โ€ฒ ๐ธ[๐‘๐‘—๐‘—โ€ฒ] log ๐ด๐‘—๐‘—โ€ฒ + ๐‘– ๐‘ก ๐‘˜ ๐‘ ๐‘ง๐‘ก = ๐‘˜|๐‘ฅ๐‘–๐‘ก log ๐‘(๐‘ฅ๐‘–๐‘ก|๐œ™๐‘˜) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
  • 19.
    E-step โ€ข Compute thecurrent expectations ๐ธ ๐‘๐‘˜ 1 = ๐‘– ๐‘(๐‘ง๐‘–1 = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘) Use backward algorithm ๐ธ ๐‘๐‘—๐‘˜ = ๐‘– ๐‘ก ๐‘(๐‘ง๐‘–,๐‘กโˆ’1 = ๐‘—, ๐‘ง๐‘–,๐‘ก = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘) Use extension of forward-backward (HW) ๐ธ ๐‘๐‘˜ = ๐‘– ๐‘ก ๐‘(๐‘ง๐‘–,๐‘ก = ๐‘˜|๐‘ฅ๐‘–, ๐œƒ๐‘œ๐‘™๐‘‘) Use forward-backward โ€ข Compare with iid models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
  • 20.
    M-step โ€ข Maximize expectedcomplete loglikelihood using current expected sufficient statistics ๐ด๐‘—๐‘˜ = ๐ธ ๐‘๐‘—๐‘˜ ๐‘˜โ€ฒ ๐ธ[๐‘๐‘—๐‘˜โ€ฒ] ๐œ‹๐‘˜ = ๐ธ ๐‘๐‘˜ 1 ๐‘ โ€ข Emission parameter estimates depend on emission distribution Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
  • 21.
    Choosing number ofhidden states โ€ข Similar to choosing number of components in mixture model โ€ข One possibility: Cross validation โ€ข Computationally more costly Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
  • 22.
    Many, many generalizationsโ€ฆ. โ€ข Work horse of signal processing, e.g. speech for decades โ€ข Continuous data โ€ข Long range dependencies โ€ข Multiple hidden layers โ€ข Handling inputs โ€ข โ€ฆ Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
  • 23.
    Linear Chain ConditionalRandom Fields โ€ข Discriminative Markov Model (recall NB vs LR) ๐‘ ๐‘ง ๐‘ฅ = 1 ๐‘(๐‘ฅ) ๐‘ก exp ๐‘˜ ๐œƒ๐‘˜๐‘“๐‘˜(๐‘ง๐‘ก, ๐‘ง๐‘กโˆ’1,๐‘ฅ๐‘ก) Where ๐‘ ๐‘ฅ = ๐‘ง ๐‘ก exp ๐‘˜ ๐œƒ๐‘˜๐‘“๐‘˜(๐‘ง๐‘ก, ๐‘ง๐‘กโˆ’1,๐‘ฅ๐‘ก) โ€ข Forward backward algorithm for inference โ€ข Gradient descent instead of EM for par estimation Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
  • 24.
    State Space Models โ€ขHMM with continuous hidden states โ€ข Linear dynamical systems โ€ข Conditional distributions are linear-Gaussian โ€ข Mathematically tractable inference โ€ข Kalman filter โ€ข Widely used in time-series analysis + forecasting, object tracking, robotics, etc Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
  • 25.
    Probabilistic Graphical Models โ€ขUsed for modelling conditional independences in joint distributions using large no. of RVs โ€ข Directed Graphical Models / Bayes Nets โ€ข Undirected Graphical Models / Random Fields โ€ข Inference computationally hard in general โ€ข Parameter estimation with hidden variables harder โ€ข Probability theory + graph theory Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 25