Max Entropy


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Max Entropy

  1. 1. Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06
  2. 2. History <ul><li>The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. </li></ul><ul><li>Introduced to NLP area by Berger et. Al. (1996). </li></ul><ul><li>Used in many NLP tasks: MT, Tagging, Parsing, PP attachment, LM, … </li></ul>
  3. 3. Outline <ul><li>Modeling: Intuition, basic concepts, … </li></ul><ul><li>Parameter training </li></ul><ul><li>Feature selection </li></ul><ul><li>Case study </li></ul>
  4. 4. Reference papers <ul><li>(Ratnaparkhi, 1997) </li></ul><ul><li>(Ratnaparkhi, 1996) </li></ul><ul><li>(Berger et. al., 1996) </li></ul><ul><li>(Klein and Manning, 2003) </li></ul><ul><li> Different notations. </li></ul>
  5. 5. Modeling
  6. 6. The basic idea <ul><li>Goal: estimate p </li></ul><ul><li>Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). </li></ul>
  7. 7. Setting <ul><li>From training data, collect (a, b) pairs: </li></ul><ul><ul><li>a: thing to be predicted (e.g., a class in a classification problem) </li></ul></ul><ul><ul><li>b: the context </li></ul></ul><ul><ul><li>Ex: POS tagging: </li></ul></ul><ul><ul><ul><li>a=NN </li></ul></ul></ul><ul><ul><ul><li>b=the words in a window and previous two tags </li></ul></ul></ul><ul><li>Learn the prob of each (a, b): p(a, b) </li></ul>
  8. 8. Features in POS tagging (Ratnaparkhi, 1996) context (a.k.a. history) allowable classes
  9. 9. Maximum Entropy <ul><li>Why maximum entropy? </li></ul><ul><li>Maximize entropy = Minimize commitment </li></ul><ul><li>Model all that is known and assume nothing about what is unknown. </li></ul><ul><ul><li>Model all that is known: satisfy a set of constraints that must hold </li></ul></ul><ul><ul><li>Assume nothing about what is unknown: </li></ul></ul><ul><ul><li>choose the most “uniform” distribution </li></ul></ul><ul><ul><li> choose the one with maximum entropy </li></ul></ul>
  10. 10. Ex1: Coin-flip example (Klein & Manning 2003) <ul><li>Toss a coin: p(H)=p1, p(T)=p2. </li></ul><ul><li>Constraint: p1 + p2 = 1 </li></ul><ul><li>Question: what’s your estimation of p=(p1, p2)? </li></ul><ul><li>Answer: choose the p that maximizes H(p) </li></ul>p1 H p1=0.3
  11. 11. Coin-flip example (cont) p1 p2 H p1 + p2 = 1 p1+p2=1.0, p1=0.3
  12. 12. Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:
  13. 13. An MT example (cont) Constraints: Intuitive answer:
  14. 14. An MT example (cont) Constraints: Intuitive answer: ??
  15. 15. Ex3: POS tagging (Klein and Manning, 2003)
  16. 16. Ex3 (cont)
  17. 17. Ex4: overlapping features (Klein and Manning, 2003)
  18. 18. Modeling the problem <ul><li>Objective function: H(p) </li></ul><ul><li>Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). </li></ul><ul><li>Question: How to represent constraints? </li></ul>
  19. 19. Features <ul><li>Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: </li></ul><ul><li>A: the set of possible classes (e.g., tags in POS tagging) </li></ul><ul><li>B: space of contexts (e.g., neighboring words/ tags in POS tagging) </li></ul><ul><li>Ex: </li></ul>
  20. 20. Some notations Finite training sample of events: Observed probability of x in S: The model p’s probability of x: Model expectation of : Observed expectation of : The j th feature: (empirical count of )
  21. 21. Constraints <ul><li>Model’s feature expectation = observed feature expectation </li></ul><ul><li>How to calculate ? </li></ul>
  22. 22. Training data  observed events
  23. 23. Restating the problem The task: find p* s.t. where Objective function: -H(p) Constraints: Add a feature
  24. 24. Questions <ul><li>Is P empty? </li></ul><ul><li>Does p* exist? </li></ul><ul><li>Is p* unique? </li></ul><ul><li>What is the form of p*? </li></ul><ul><li>How to find p*? </li></ul>
  25. 25. What is the form of p*? (Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique.
  26. 26. Using Lagrangian multipliers Minimize A(p):
  27. 27. Two equivalent forms
  28. 28. Relation to Maximum Likelihood The log-likelihood of the empirical distribution as predicted by a model q is defined as Theorem: if then Furthermore, p* is unique.
  29. 29. Summary (so far) The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample Goal: find p* in P, which maximizes H(p). It can be proved that when p* exists it is unique.
  30. 30. Summary (cont) <ul><li>Adding constraints (features): </li></ul><ul><li>(Klein and Manning, 2003) </li></ul><ul><ul><li>Lower maximum entropy </li></ul></ul><ul><ul><li>Raise maximum likelihood of data </li></ul></ul><ul><ul><li>Bring the distribution further from uniform </li></ul></ul><ul><ul><li>Bring the distribution closer to data </li></ul></ul>
  31. 31. Parameter estimation
  32. 32. Algorithms <ul><li>Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) </li></ul><ul><li>Improved Iterative Scaling (IIS): (Della Pietra et al., 1995) </li></ul>
  33. 33. GIS: setup <ul><li>Requirements for running GIS: </li></ul><ul><li>Obey form of model and constraints: </li></ul><ul><li>An additional constraint: </li></ul>Let Add a new feature f k+1 :
  34. 34. GIS algorithm <ul><li>Compute d j , j=1, …, k+1 </li></ul><ul><li>Initialize (any values, e.g., 0) </li></ul><ul><li>Repeat until converge </li></ul><ul><ul><li>For each j </li></ul></ul><ul><ul><ul><li>Compute </li></ul></ul></ul><ul><ul><ul><li>Update </li></ul></ul></ul>where
  35. 35. Approximation for calculating feature expectation
  36. 36. Properties of GIS <ul><li>L(p (n+1) ) >= L(p (n) ) </li></ul><ul><li>The sequence is guaranteed to converge to p*. </li></ul><ul><li>The converge can be very slow. </li></ul><ul><li>The running time of each iteration is O(NPA): </li></ul><ul><ul><li>N: the training set size </li></ul></ul><ul><ul><li>P: the number of classes </li></ul></ul><ul><ul><li>A: the average number of features that are active for a given event (a, b). </li></ul></ul>
  37. 37. IIS algorithm <ul><li>Compute d j , j=1, …, k+1 and </li></ul><ul><li>Initialize (any values, e.g., 0) </li></ul><ul><li>Repeat until converge </li></ul><ul><ul><li>For each j </li></ul></ul><ul><ul><ul><li>Let be the solution to </li></ul></ul></ul><ul><ul><ul><li>Update </li></ul></ul></ul>
  38. 38. Calculating If Then GIS is the same as IIS Else must be calcuated numerically.
  39. 39. Feature selection
  40. 40. Feature selection <ul><li>Throw in many features and let the machine select the weights </li></ul><ul><ul><li>Manually specify feature templates </li></ul></ul><ul><li>Problem: too many features </li></ul><ul><li>An alternative: greedy algorithm </li></ul><ul><ul><li>Start with an empty set S </li></ul></ul><ul><ul><li>Add a feature at each iteration </li></ul></ul>
  41. 41. Notation The gain in the log-likelihood of the training data: After adding a feature: With the feature set S:
  42. 42. Feature selection algorithm (Berger et al., 1996) <ul><li>Start with S being empty; thus p s is uniform. </li></ul><ul><li>Repeat until the gain is small enough </li></ul><ul><ul><li>For each candidate feature f </li></ul></ul><ul><ul><ul><li>Computer the model using IIS </li></ul></ul></ul><ul><ul><ul><li>Calculate the log-likelihood gain </li></ul></ul></ul><ul><ul><li>Choose the feature with maximal gain, and add it to S </li></ul></ul> Problem: too expensive
  43. 43. Approximating gains (Berger et. al., 1996) <ul><li>Instead of recalculating all the weights, calculate only the weight of the new feature. </li></ul>
  44. 44. Training a MaxEnt Model <ul><li>Scenario #1: </li></ul><ul><li>Define features templates </li></ul><ul><li>Create the feature set </li></ul><ul><li>Determine the optimum feature weights via GIS or IIS </li></ul><ul><li>Scenario #2: </li></ul><ul><li>Define feature templates </li></ul><ul><li>Create candidate feature set S </li></ul><ul><li>At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights). </li></ul>
  45. 45. Case study
  46. 46. POS tagging (Ratnaparkhi, 1996) <ul><li>Notation variation: </li></ul><ul><ul><li>f j (a, b): a: class, b: context </li></ul></ul><ul><ul><li>f j (h i , t i ): h: history for i th word, t: tag for i th word </li></ul></ul><ul><li>History: </li></ul><ul><li>Training data: </li></ul><ul><ul><li>Treat it as a list of (h i , t i ) pairs. </li></ul></ul><ul><ul><li>How many pairs are there? </li></ul></ul>
  47. 47. Using a MaxEnt Model <ul><li>Modeling: </li></ul><ul><li>Training: </li></ul><ul><ul><li>Define features templates </li></ul></ul><ul><ul><li>Create the feature set </li></ul></ul><ul><ul><li>Determine the optimum feature weights via GIS or IIS </li></ul></ul><ul><li>Decoding: </li></ul>
  48. 48. Modeling
  49. 49. Training step 1: define feature templates History h i Tag t i
  50. 50. Step 2: Create feature set <ul><li>Collect all the features from the training data </li></ul><ul><li>Throw away features that appear less than 10 times </li></ul>
  51. 51. Step 3: determine the feature weights <ul><li>GIS </li></ul><ul><li>Training time: </li></ul><ul><ul><li>Each iteration: O(NTA): </li></ul></ul><ul><ul><ul><li>N: the training set size </li></ul></ul></ul><ul><ul><ul><li>T: the number of allowable tags </li></ul></ul></ul><ul><ul><ul><li>A: average number of features that are active for a (h, t). </li></ul></ul></ul><ul><ul><li>About 24 hours on an IBM RS/6000 Model 380. </li></ul></ul><ul><li>How many features? </li></ul>
  52. 52. Decoding: Beam search <ul><li>Generate tags for w 1 , find top N, set s 1j accordingly, j=1, 2, …, N </li></ul><ul><li>For i=2 to n (n is the sentence length) </li></ul><ul><ul><li>For j=1 to N </li></ul></ul><ul><ul><ul><li>Generate tags for wi, given s (i-1)j as previous tag context </li></ul></ul></ul><ul><ul><ul><li>Append each tag to s (i-1)j to make a new sequence. </li></ul></ul></ul><ul><ul><li>Find N highest prob sequences generated above, and set s ij accordingly, j=1, …, N </li></ul></ul><ul><li>Return highest prob sequence s n1 . </li></ul>
  53. 53. Beam search
  54. 54. Viterbi search
  55. 55. Decoding (cont) <ul><li>Tags for words: </li></ul><ul><ul><li>Known words: use tag dictionary </li></ul></ul><ul><ul><li>Unknown words: try all possible tags </li></ul></ul><ul><li>Ex: “time flies like an arrow” </li></ul><ul><li>Running time: O(NTAB) </li></ul><ul><ul><li>N: sentence length </li></ul></ul><ul><ul><li>B: beam size </li></ul></ul><ul><ul><li>T: tagset size </li></ul></ul><ul><ul><li>A: average number of features that are active for a given event </li></ul></ul>
  56. 56. Experiment results
  57. 57. Comparison with other learners <ul><li>HMM: MaxEnt uses more context </li></ul><ul><li>SDT: MaxEnt does not split data </li></ul><ul><li>TBL: MaxEnt is statistical and it provides probability distributions. </li></ul>
  58. 58. MaxEnt Summary <ul><li>Concept: choose the p* that maximizes entropy while satisfying all the constraints. </li></ul><ul><li>Max likelihood: p* is also the model within a model family that maximizes the log-likelihood of the training data. </li></ul><ul><li>Training: GIS or IIS, which can be slow. </li></ul><ul><li>MaxEnt handles overlapping features well. </li></ul><ul><li>In general, MaxEnt achieves good performances on many NLP tasks. </li></ul>
  59. 59. Additional slides
  60. 60. Ex4 (cont) ??