Conditional Random Fields


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Conditional Random Fields

  1. 1. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data<br />John Lafferty, Andrew McCallum, Fernando Pereira<br />Speaker : Shu-Ying Li<br />1<br />
  2. 2. Outline<br />Introduction<br />Conditional Random Fields<br />Parameter Estimated for CRFs<br />Experiments<br />Conclusions<br />2<br />
  3. 3. Introduction<br />Sequence Segmenting and Labeling<br /><ul><li>Goal : mark up sequences with content tags.</li></ul>Generative Models<br /><ul><li>Hidden Markov Model (HMM) and stochastic grammars
  4. 4. Assign a joint probability to paired observation and label sequences
  5. 5. The parameters typically trained to maximize the joint likelihood of train examples</li></ul>St-1<br />St<br />St+1<br />Ot<br />Ot+1<br />3<br />
  6. 6. Introduction(cont.)<br />Conditional Model<br /><ul><li>Conditional probability P(label sequence y | observation sequence x)rather than P(y,x)
  7. 7. Allow arbitrary, non-independent features of the observation sequence X.
  8. 8. The probability of a transition between labels may depend on past and feature observations.</li></ul>Maximum Entropy Markov Models (MEMMs)<br />St-1<br />St<br />St+1<br />...<br />Ot<br />Ot+1<br />Ot-1<br />4<br />
  9. 9. Introduction(cont.)<br />The Label Bias Problem:<br /><ul><li>Bias toward states with fewer outgoing transitions.</li></ul>Pr(1 and 2|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r)<br />Pr(1 and 2|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r)<br />Pr(2|1,o) = Pr(2|1,r) = 1<br />Pr(1 and 2|ro) = Pr(1 and 2|ri) <br />But it should be Pr(1 and 2|ro) &lt; Pr(1 and 2|ri)! <br />5<br />
  10. 10. Introduction(cont.)<br />Solve the Label Bias Problem<br />Change the state-transition structure of the model<br />Start with fully-connected model and let the training procedure figure out a good structure.<br />6<br />
  11. 11. Conditional Random Fields<br />Random Field<br /><ul><li>Let G = (Y, E) be a graph where each vertex Yv is a random variable. Suppose P(Yv| all other Y) = P(Yv| neighbors(Yv)) then Y is a random field.</li></ul>Example :<br /><ul><li>P(Y5 | all other Y) = P(Y5 | Y4, Y6)</li></ul>7<br />
  12. 12. Conditional Random Fields<br />Suppose P(Yv| X, all other Y) = P(Yv|X, neighbors(Yv)) then X with Y is a conditional random field<br /><ul><li>X : observations, Y : labels
  13. 13. P(Y3 | X, all other Y) = P(Y3 |X, Y2, Y4)</li></ul>X = X1,…, Xn-1, Xn<br />8<br />
  14. 14. Conditional Random Fields<br />9<br />Conditional Distribution[2]<br /><ul><li>tj(yi−1, yi, x, i) is a transition feature function of the entire observation sequence and the labels at positions i and i−1 in the label sequence
  15. 15. sk(yi, x, i) is a state feature function of the label at position i and the observation sequence
  16. 16. λkand μkare parameters to be estimated from training data.</li></ul>Conditional Distribution[1]<br /><ul><li>x : data sequence
  17. 17. y : label sequence
  18. 18. v : vertex from vertex set V
  19. 19. e : edge from edge set E over V
  20. 20. fk: Boolean vertex feature; gk : Boolean edge feature
  21. 21. k : the number of features
  22. 22. λk and μk are parameters to be estimated
  23. 23. y|e is the set of components of y defined by edge e
  24. 24. y|v is the set of components of y defined by vertex v</li></ul>Yt-1<br />Yt<br />Yt+1<br />...<br />Xt<br />Xt+1<br />Xt-1<br />
  25. 25. Conditional Random Fields<br />Conditional Distribution<br /><ul><li>CRFs use Z(x) for the conditional distributions
  26. 26. Z(x) is a normalization over the data sequence x
  27. 27. [1] :
  28. 28. [2] : </li></ul> where each fj(yi-1, yi, x, i) is either a state function s(yi-1, yi, x, i) or a transition <br /> function t(yi-1, yi, x, i).<br />10<br />
  29. 29. Conditional Random Fields<br />For a chain-structured CRF, we add special start and end states, y0 and yn+1, with labels start and end respectively.<br /><ul><li>let be the alphabet from which labels are drawn.
  30. 30. Y’ and y are labels drawn from this alphabet.
  31. 31. Define a set of n+1 matrices {Mi(x)|i=1,…,n+1}, where each Mi(x) is a matrix with elements of the form</li></ul>= exp ( )<br />11<br />
  32. 32. Conditional Random Fields<br />The normalization function is the (start, end) entry of the product of these matrices.<br />The conditional probability of label sequence y is: <br />[1]<br /> [2]<br /> where, y0 = start and yn+1 = end<br />12<br />
  33. 33. Parameter Estimated for CRFs<br />Problem definition : determine the parameters θ= (λ1,λ2,…;μ1,μ2…).<br />Goal : maximize the log-likelihood objective function.<br />13<br />[1]<br /><br />where is the empirical distribution of training data.<br />This function is concave, guaranteeing convergence to the global maximum.<br />[2]<br />Ep[‧]denotes expectation with respect to distribution p<br />
  34. 34. Parameter Estimated for CRFs<br />14<br />Iterative Scaling Algorithms [1]<br /><ul><li>Update the weights asλk← λk+ δλk and μ1 ← μk+ δμkfor appropriately chosen δλk and μk
  35. 35. δλk for edge feature fk is the solution of
  36. 36. Efficiently computing the exponential sums on the right-hand sides of the these equations is problematic.</li></ul>->Because T(x, y) is a global property of (x, y) and dynamic programming will sum over sequence with potentially varying T.<br />Dynamic Programming [2]<br />
  37. 37. Parameter Estimated for CRFs<br />For each index i=0,…,n+1, we define forward vectors αi(x) and backward vectors βi(x) : <br />[1] :<br />[2]:<br />15<br />
  38. 38. Parameter Estimated for CRFs<br />To deal with this , two algorithm are proposed.<br /><ul><li>Algorithm S uses a “slack feature”
  39. 39. Algorithm T keeps track of partial T totals.</li></ul>Algorithm S [1]<br /><ul><li>Define slack feature
  40. 40. Where S is a constant chosen so that s(x(i) , y) 0 for all y and all observation vectors x(i) in the training set
  41. 41. Thus makingT(x, y) = S.
  42. 42. Feature s is “global” : it does not correspond to any particular edge or vertex.</li></ul>16<br />
  43. 43. Parameter Estimated for CRFs<br />Algorithm S [1]<br /> where<br />δλk s<br />=<br />=<br />=<br />17<br />
  44. 44. Parameter Estimated for CRFs<br />Algorithm S [1]<br />The constant S in algorithm S can be quite large, since in practice it is proportional to the length of the longest training observation sequence.<br />The algorithm may converge slowly, taking very small steps toward the maximum in each iteration.<br />18<br />
  45. 45. Parameter Estimated for CRFs<br />Algorithm T [1]<br /><ul><li>It accumulates feature expectations into counters indexed by T(x).
  46. 46. Use forward-back ward recurrences to compute the expectations ak,t of feature fk and bk,t of feature gk given that T(x) = t.</li></ul>βk and γk are the unique positive roots to the following polynomial equations.<br /> which can be easily computed by Newton’s method.<br />19<br />
  47. 47. Experiments<br />Discuss two set of experiments with synthetic data that highlight the differences between CRFs and MEMMs.<br />Modeling label bias problem<br />Modeling mixed-order source<br />Modeling label bias problem<br /><ul><li>2000 training and 500 test samples generated by HMM
  48. 48. CRFs error rate : 4.6%
  49. 49. MEMMs error rate : 42%
  50. 50. CRFs solve the label bias problem.</li></ul>20<br />
  51. 51. Experiments<br />Modeling mixed-order source<br /><ul><li>CRFs converge in 500 iterations.
  52. 52. MEMMs converge in 100 iterations.</li></ul>MEMMs vs. HMM<br />21<br />
  53. 53. Experiments<br />CRFs vs. MEMMs<br />22<br />
  54. 54. Experiments<br />CRFs vs. HMM<br /><ul><li>Each open square represents a data set with  < ½, and a sold square indicates a data set with a   ½.
  55. 55. When the data is mostlysecond order   ½, the discriminatively trained CRF usually outperforms the MEMM</li></ul>23<br />
  56. 56. Experiments<br />POS Tagging Experiments<br /><ul><li>First-order HMM, MEMM and CRF model
  57. 57. Data set: Penn Tree bank
  58. 58. 50-50% test-train split
  59. 59. Use the optimal MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.</li></ul>24<br />
  60. 60. Conclusions<br />Discriminatively trained models for sequence segmentation and labeling.<br />Combination of arbitrary, overlapping and agglomerative observation features from both the past and future.<br />Efficient training and decoding based on dynamic programming.<br />Parameter estimation guaranteed to find the global optimum.<br />25<br />
  61. 61. Reference<br />26<br />J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilisticmodels for segmenting and labeling sequence data. In InternationalConference on Machine Learning, 2001. <br />Hanna M. Wallach. Conditional Random Fields: An Introduction. University of Pennsylvania CIS Technical Report MS-CIS-04-21.<br />參考投影片(by RongkunShen)<br />