Conll

617 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Conll

  1. 1. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen Univ. of Toronto Univ. of Minnesota, Duluth http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse
  2. 2. Word Sense Disambiguation <ul><li>Harry cast a bewitching spell </li></ul><ul><li>We understand target word spell in this context to mean charm or incantation </li></ul><ul><ul><li>not “reading out letter by letter” or “a period of time” </li></ul></ul><ul><li>Automatically identifying the intended sense of a word based on its context is hard ! </li></ul><ul><ul><li>Best accuracies often around 65%-75% </li></ul></ul>
  3. 3. WSD as Classification <ul><li>Learn model for a given target word from a corpus of manually sense tagged training examples </li></ul><ul><li>The model assigns target word a sense based on the context in which it occurs </li></ul><ul><ul><li>Context represented by feature set </li></ul></ul><ul><li>Evaluate model on held out test set </li></ul>
  4. 4. Motivations <ul><li>Lexical features do “reasonably well” at supervised WSD… </li></ul><ul><ul><li>Duluth systems in Senseval-2 </li></ul></ul><ul><ul><li>Pedersen NAACL-2001 </li></ul></ul><ul><li>POS features do “reasonably well” too </li></ul><ul><li>Complementary or redundant? </li></ul><ul><ul><li>Complementary? Find the simplest ways to represent instances and combine results to improve performance </li></ul></ul><ul><ul><li>Redundant? We can reduce feature space without affecting performance </li></ul></ul>
  5. 5. Decision Trees <ul><li>Assigns sense to an instance by asking a series of questions </li></ul><ul><li>Questions correspond to features of the instance and depend on previous answer </li></ul><ul><li>In the tree… </li></ul><ul><ul><li>Top most node is called the root </li></ul></ul><ul><ul><li>Each node corresponds to a feature </li></ul></ul><ul><ul><li>Each value of a feature has a branch </li></ul></ul><ul><ul><li>Each path terminates in a sense/ leaf </li></ul></ul>
  6. 6. WSD Tree Feature 4? Feature 4 ? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3 SENSE 2 SENSE 1 SENSE 3 SENSE 3 0 0 0 1 1 1 0 1 0 1 0 1 Feature 1 ? SENSE 1
  7. 7. Why Decision Trees? <ul><li>Many kinds of features can contribute to WSD performance </li></ul><ul><li>Many learning algorithms result in comparable classifiers when given the same set of features </li></ul><ul><li>A learned decision tree captures interactions among features </li></ul><ul><li>Many implementations available </li></ul><ul><ul><li>Weka J48 </li></ul></ul>
  8. 8. Lexical Features <ul><li>Surface form </li></ul><ul><ul><li>Observed form of target word </li></ul></ul><ul><li>Unigrams and Bigrams </li></ul><ul><ul><li>One and two word sequences </li></ul></ul><ul><li>Ngram Statistics Package </li></ul><ul><ul><li>http://www.d.umn.edu/~tpederse/nsp.html </li></ul></ul>
  9. 9. POS Features <ul><li>Surrounding POS indicate different sense: </li></ul><ul><li>Why did Jack turn / VB against /IN his /PRP$ team /NN </li></ul><ul><li>Why did Jack turn / VB left /NN at /IN the /DT crossing </li></ul><ul><li>Individual word POS: P -2 , P -1 , P 0 , P 1 , P 2 </li></ul><ul><li>Used individually and in combination </li></ul>
  10. 10. Part of Speech Tagging <ul><li>Brill Tagger </li></ul><ul><ul><li>Open Source </li></ul></ul><ul><ul><li>Easy to Understand </li></ul></ul><ul><li>Guaranteed Pre-Tagging </li></ul><ul><ul><li>Manually tag target words </li></ul></ul><ul><ul><li>Implemented in BrillPatch </li></ul></ul>
  11. 11. Parse Features <ul><li>Head word of the target phrase </li></ul><ul><ul><li>the hard work , the hard surface </li></ul></ul><ul><li>Head word of the parent phrase </li></ul><ul><ul><li>fasten the line , cross the line </li></ul></ul><ul><li>Target and parent phrase POS </li></ul><ul><ul><li>noun phrase, verb phrase… </li></ul></ul><ul><li>Used individually and in combination </li></ul><ul><li>Obtained via Collins Parser </li></ul>
  12. 12. Experiments <ul><li>How accurate are simple classifiers based on a single feature type? </li></ul><ul><li>How complementary or redundant are lexical and syntactic features? </li></ul><ul><li>Is it possible (in theory at least) to combine just a few very simple classifiers and achieve near state of the art results? </li></ul>
  13. 13. Experiments <ul><li>Learn a decision tree based on a single feature type </li></ul><ul><ul><li>Surface, Unigram, Bigram, POS, Parse, … </li></ul></ul><ul><li>Combine pairs of these trees via a simple ensemble technique </li></ul><ul><ul><li>Weighted vote </li></ul></ul>
  14. 14. Sense-Tagged Data <ul><li>Senseval-2 data </li></ul><ul><ul><li>4328 test instances, 8611 training instances </li></ul></ul><ul><ul><li>73 nouns, verbs and adjectives. </li></ul></ul><ul><li>Senseval-1 data </li></ul><ul><ul><li>8512 test instances, 13276 training instances </li></ul></ul><ul><ul><li>35 nouns, verbs and adjectives. </li></ul></ul><ul><li>line, hard, interest, serve data </li></ul><ul><ul><li>4149, 4337, 4378 and 2476 instances </li></ul></ul><ul><ul><li>50,000 sense-tagged instances in all! </li></ul></ul>
  15. 15. Lexical Features 72.9% 74.5% 54.3% 54.3% line 66.9% 66.9% 62.9% 56.3% Sval-1 89.5% 83.4% 81.5% 81.5% hard 72.1% 73.3% 44.2% 42.2% serve 79.9% 55.1% Bigram 75.7% 55.3% Unigram 64.0% 49.3% Surface Form 54.9% 47.7% Majority interest Sval-2
  16. 16. POS Features 54.9% 42.2% 81.5% 54.3% 56.3% 47.7% majority 62.3% 75.7% 81.7% 54.3% 59.9% 48.9% P 2 65.3% 73.0% 81.6% 54.2% 63.9% 53.1% P 1 64.0% 58.0% 81.6% 54.3% 60.3% 49.9% P 0 62.7% 60.2% 82.1% 56.2% 59.2% 49.6% P -1 56.0% 60.3% 81.6% 54.9% 57.5% 47.1% P -2 interest serve hard line Sval-1 Sval-2
  17. 17. Combining POS Features 62.3% 60.4% 54.1% 54.3% line 86.2% 84.8% 81.9% 81.5% hard 75.7% 73.0% 60.2% 42.2% serve 67.8% 68.0% 66.7% 56.3% Sval-1 80.6% 78.8% 70.5% 54.9% interest 54.6% P -2 , P -1 , P 0 , P 1 , P 2 54.6% P -1 , P 0 , P 1 54.3% P 0 , P 1 47.7% Majority Sval-2
  18. 18. Parse Features 54.9% 41.4% 81.5% 54.3% 58.5% 52.9% Phrase POS 54.3% 59.8% 54.7% 54.3% line 81.7% 84.5% 87.8% 81.5% hard 41.6% 57.2% 47.4% 42.2% serve 57.9% 60.6% 64.3% 56.3% Sval-1 54.9% 67.8% 69.1% 54.9% interest 52.7% Parent Phrase POS 50.0% Parent Word 51.7% Head Word 47.7% Majority Sval-2
  19. 19. Discussion <ul><li>Lexical and syntactic features perform comparably. </li></ul><ul><li>Do they get the same instances right ? </li></ul><ul><li>Are there instances disambiguated by one feature set and not by the other? </li></ul><ul><ul><li>How much are the individual feature sets complementary? </li></ul></ul>
  20. 20. Measures <ul><li>Baseline Ensemble : accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so. </li></ul><ul><li>Optimal Ensemble : accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so. </li></ul>
  21. 21. Our Ensemble Approach <ul><li>We use a weighted vote ensemble to decide the sense of a target word </li></ul><ul><li>For a given test instance, takes the output of two classifiers (one lexical and one syntactic) and sums the probabilities associated with each possible sense </li></ul>
  22. 22. Best Combinations 89.0% 90.1% 83.2% 67.6% P -1 ,P 0 , P 1 78.8% Bigrams 79.9% interest 54.9% 83.0% 89.9% 81.6% 58.4% P -1 ,P 0 , P 1 73.0% Unigrams 73.3% serve 42.2% 83.0% 91.3% 88.9% 86.1% Head, Parent 87.7% Bigrams 89.5% hard 81.5% 88.0% 82.0% 74.2% 55.1% P -1 ,P 0 , P 1 60.4% Unigrams 74.5% line 54.3% 81.1% 78.0% 71.1% 57.6% P -1 ,P 0 , P 1 68.0% Unigrams 66.9% Sval-1 56.3% 66.7% 67.9% 57.0% 43.6% P -1 ,P 0 , P 1 55.3% Unigrams 55.3% Sval-2 47.7% Best Optimal Ours Base Set 2 Set 1 Data
  23. 23. Conclusions <ul><li>Reasonable amount of complementarity across lexical and syntactic features. </li></ul><ul><li>Simple lexical and part of speech features can be combined to achieve state of the art results. </li></ul><ul><li>Future Work : How best to capitalize on the complementarity? </li></ul>
  24. 24. Senseval-3 <ul><li>Approx. 8000 training and 4000 test instances. </li></ul><ul><ul><li>English lexical sample task. </li></ul></ul><ul><li>Training data collected via Open Mind Word Expert. </li></ul><ul><li>Comparative results unveiled at ACL workshop! </li></ul>
  25. 25. Software and Data <ul><li>SyntaLex : WSD using lexical and syntactic features. </li></ul><ul><li>posSenseval : POS tag data in Senseval-2 format using Brill Tagger. </li></ul><ul><li>parseSenseval : parse output from Brill Tagger using Collins Parser. </li></ul><ul><li>BrillPatch : Supports Guaranteed Pre-Tagging. </li></ul><ul><li>Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats. </li></ul><ul><li>http://www.d.umn.edu/~tpederse/code.html </li></ul><ul><li>http://www.d.umn.edu/~tpederse/data.html </li></ul>
  26. 26. Individual Word POS : Senseval-1 64.3% 58.2% 62.2% 59.2% P -1 64.3% 58.2% 62.5% 60.3% P 0 66.2% 64.4% 65.4% 63.9% P 1 64.0 58.6% 58.2% 57.5% P -2 65.2% 60.8% 60.0% 59.9% P -2 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
  27. 27. Individual Word POS: Senseval-2 59.0% 40.2% 55.2% 49.6% P -1 58.2% 40.6% 55.7% 49.9% P 0 61.0% 49.1% 53.8% 53.1% P 1 57.9% 38.0% 51.9% 47.1% P -2 59.4% 43.2% 50.2% 48.9% P -2 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All
  28. 28. Parse Features: Senseval-1 65.8% 60.3% 62.6% 60.6% Parent Word 66.2% 57.2% 57.5% 58.5% Phrase 66.2% 58.3% 58.1% 57.9% Parent Phrase 66.9% 59.8% 70.9% 64.3% Head Word 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
  29. 29. Parse Features: Senseval-2 59.3% 40.1% 56.1% 50.0% Parent 59.5% 40.3% 51.7% 48.3% Phrase 60.3% 39.1% 53.0% 48.5% Parent Phrase 64.0% 39.8% 58.5% 51.7% Head 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All

×