Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
339
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 1. In case of neural n/w for example, the learned model is quite meaningless.
  • 1. Bigrams and unigrams, (interest rate) and (rate) suggest the financial sense of interest.
  • Notice the different tag sets on the right of turn . P0, P-2 etc have similar meanings By combination I mean one tree where the nodes may be any of the different pos features: P0 or P1 or P-2 and so on.
  • If we know the pos of certain words, pretagging such words can improve overall quality of pos tagging by the automatic tagger. Note we are no longer confident of the quality of tagging around target word in case of mistags. We found a lot of such mis-taggings of the head words in Sval-1 and 2 data (5% of head words had radical mistags and 20% mistags in all (radical and subtle)). So we decided to find out why this was happening and hopefully do something abt it.
  • We wanted to utilize the guaranteed pre-tagging for a higher quality parsing. Head and parent words are marked in red and all 4 of them suggest a particular sense of hard and line . The hard work --- not easy, difficult sense The hard surface --- not soft, physical sense Fasten the line --- cord sense Cross the line --- division sense
  • Sval-1 (2-24) and Sval-2 (2-32) data created such that target words with varying number of senses are represented. Sval-1 annotated with senses from HECTOR, Sval-2 from WordNet. 2. Interest data created by Bruce and Weibe from penn treebank and WSJ (ACL/DCI version) Annotated with 6 senses from LDOCE 3. Serve data created by Leacock Chodrow from WSJ (1987-89) and APHB corpus. Annotated with four senses from WordNet. 4. Hard data created by Leacock Chodrow from SJM corpus. Annotated with three senses from WordNet. 5. line data created by Leacock et al. from WSJ (1987-89) and APHB corpus. Annotated with 6 senses from WordNet.
  • Surface form does not do much better than baseline. Unigrams and Bigrams both do significantly well (esp. considering they are lexical features, easily captured).
  • Simple comb of pos ftrs does almost as well as unigrams and bigrams. Note, much lower number of features utilized as compared to unigrams and bigrams. P0,P1 found to be most potent combination for Sval-1 and 2. Larger context found to be much more helpful for line, hard, serve and interest data as compared to the Sval data. We think that this is because of the much larger amounts of training data.
  • Simple comb of pos ftrs does almost as well as unigrams and bigrams. Note, much lower number of features utilized as compared to unigrams and bigrams. P0,P1 found to be most potent combination for Sval-1 and 2. Larger context found to be much more helpful for line, hard, serve and interest data as compared to the Sval data. We think that this is because of the much larger amounts of training data.
  • Optimal ensemble is the upper bound for accuracy achivable by an ensemble technique. One tree with all feature may yield even better results but we cannot say much about that and is beyond the scope of this work.
  • Note: reasonable amount of redundancy (Base): that was expected Note: the simple ensemble does slightly better than individual features in case of line and hard data it does worse (not sure why) Suggests that a powerful ensemble technique is desirable Note: the large amounts of complementarity as suggested by the optimal ensemble values which are around the best achieved so far. Combination of simple lexical and syntactic features can results close to state of art.
  • We have improvements over baseline (much is not expected as we are using just individual pos) Interestingly P1 is found to be best (we found this in all data) Break down into individual pos shows that … Verbs and adjectives do best with P1 Verb-object relations is in effect getting captured. Nouns are helped by pos tags on either side Subj-verb and verb-object relation (hence both sides help).
  • 1. Similar results as in Sval-1.
  • Head found to be best Verbs are usually head themselves and hence the head ftr is not very useful for them. Parent found to do reasonable well.
  • 1. Similar results as last slide.

Transcript

  • 1. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen Univ. of Toronto Univ. of Minnesota, Duluth http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse
  • 2. Word Sense Disambiguation
    • Harry cast a bewitching spell
    • We understand target word spell in this context to mean charm or incantation
      • not “reading out letter by letter” or “a period of time”
    • Automatically identifying the intended sense of a word based on its context is hard !
      • Best accuracies often around 65%-75%
  • 3. WSD as Classification
    • Learn model for a given target word from a corpus of manually sense tagged training examples
    • The model assigns target word a sense based on the context in which it occurs
      • Context represented by feature set
    • Evaluate model on held out test set
  • 4. Motivations
    • Lexical features do “reasonably well” at supervised WSD…
      • Duluth systems in Senseval-2
      • Pedersen NAACL-2001
    • POS features do “reasonably well” too
    • Complementary or redundant?
      • Complementary? Find the simplest ways to represent instances and combine results to improve performance
      • Redundant? We can reduce feature space without affecting performance
  • 5. Decision Trees
    • Assigns sense to an instance by asking a series of questions
    • Questions correspond to features of the instance and depend on previous answer
    • In the tree…
      • Top most node is called the root
      • Each node corresponds to a feature
      • Each value of a feature has a branch
      • Each path terminates in a sense/ leaf
  • 6. WSD Tree Feature 4? Feature 4 ? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3 SENSE 2 SENSE 1 SENSE 3 SENSE 3 0 0 0 1 1 1 0 1 0 1 0 1 Feature 1 ? SENSE 1
  • 7. Why Decision Trees?
    • Many kinds of features can contribute to WSD performance
    • Many learning algorithms result in comparable classifiers when given the same set of features
    • A learned decision tree captures interactions among features
    • Many implementations available
      • Weka J48
  • 8. Lexical Features
    • Surface form
      • Observed form of target word
    • Unigrams and Bigrams
      • One and two word sequences
    • Ngram Statistics Package
      • http://www.d.umn.edu/~tpederse/nsp.html
  • 9. POS Features
    • Surrounding POS indicate different sense:
    • Why did Jack turn / VB against /IN his /PRP$ team /NN
    • Why did Jack turn / VB left /NN at /IN the /DT crossing
    • Individual word POS: P -2 , P -1 , P 0 , P 1 , P 2
    • Used individually and in combination
  • 10. Part of Speech Tagging
    • Brill Tagger
      • Open Source
      • Easy to Understand
    • Guaranteed Pre-Tagging
      • Manually tag target words
      • Implemented in BrillPatch
  • 11. Parse Features
    • Head word of the target phrase
      • the hard work , the hard surface
    • Head word of the parent phrase
      • fasten the line , cross the line
    • Target and parent phrase POS
      • noun phrase, verb phrase…
    • Used individually and in combination
    • Obtained via Collins Parser
  • 12. Experiments
    • How accurate are simple classifiers based on a single feature type?
    • How complementary or redundant are lexical and syntactic features?
    • Is it possible (in theory at least) to combine just a few very simple classifiers and achieve near state of the art results?
  • 13. Experiments
    • Learn a decision tree based on a single feature type
      • Surface, Unigram, Bigram, POS, Parse, …
    • Combine pairs of these trees via a simple ensemble technique
      • Weighted vote
  • 14. Sense-Tagged Data
    • Senseval-2 data
      • 4328 test instances, 8611 training instances
      • 73 nouns, verbs and adjectives.
    • Senseval-1 data
      • 8512 test instances, 13276 training instances
      • 35 nouns, verbs and adjectives.
    • line, hard, interest, serve data
      • 4149, 4337, 4378 and 2476 instances
      • 50,000 sense-tagged instances in all!
  • 15. Lexical Features 72.9% 74.5% 54.3% 54.3% line 66.9% 66.9% 62.9% 56.3% Sval-1 89.5% 83.4% 81.5% 81.5% hard 72.1% 73.3% 44.2% 42.2% serve 79.9% 55.1% Bigram 75.7% 55.3% Unigram 64.0% 49.3% Surface Form 54.9% 47.7% Majority interest Sval-2
  • 16. POS Features 54.9% 42.2% 81.5% 54.3% 56.3% 47.7% majority 62.3% 75.7% 81.7% 54.3% 59.9% 48.9% P 2 65.3% 73.0% 81.6% 54.2% 63.9% 53.1% P 1 64.0% 58.0% 81.6% 54.3% 60.3% 49.9% P 0 62.7% 60.2% 82.1% 56.2% 59.2% 49.6% P -1 56.0% 60.3% 81.6% 54.9% 57.5% 47.1% P -2 interest serve hard line Sval-1 Sval-2
  • 17. Combining POS Features 62.3% 60.4% 54.1% 54.3% line 86.2% 84.8% 81.9% 81.5% hard 75.7% 73.0% 60.2% 42.2% serve 67.8% 68.0% 66.7% 56.3% Sval-1 80.6% 78.8% 70.5% 54.9% interest 54.6% P -2 , P -1 , P 0 , P 1 , P 2 54.6% P -1 , P 0 , P 1 54.3% P 0 , P 1 47.7% Majority Sval-2
  • 18. Parse Features 54.9% 41.4% 81.5% 54.3% 58.5% 52.9% Phrase POS 54.3% 59.8% 54.7% 54.3% line 81.7% 84.5% 87.8% 81.5% hard 41.6% 57.2% 47.4% 42.2% serve 57.9% 60.6% 64.3% 56.3% Sval-1 54.9% 67.8% 69.1% 54.9% interest 52.7% Parent Phrase POS 50.0% Parent Word 51.7% Head Word 47.7% Majority Sval-2
  • 19. Discussion
    • Lexical and syntactic features perform comparably.
    • Do they get the same instances right ?
    • Are there instances disambiguated by one feature set and not by the other?
      • How much are the individual feature sets complementary?
  • 20. Measures
    • Baseline Ensemble : accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so.
    • Optimal Ensemble : accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so.
  • 21. Our Ensemble Approach
    • We use a weighted vote ensemble to decide the sense of a target word
    • For a given test instance, takes the output of two classifiers (one lexical and one syntactic) and sums the probabilities associated with each possible sense
  • 22. Best Combinations 89.0% 90.1% 83.2% 67.6% P -1 ,P 0 , P 1 78.8% Bigrams 79.9% interest 54.9% 83.0% 89.9% 81.6% 58.4% P -1 ,P 0 , P 1 73.0% Unigrams 73.3% serve 42.2% 83.0% 91.3% 88.9% 86.1% Head, Parent 87.7% Bigrams 89.5% hard 81.5% 88.0% 82.0% 74.2% 55.1% P -1 ,P 0 , P 1 60.4% Unigrams 74.5% line 54.3% 81.1% 78.0% 71.1% 57.6% P -1 ,P 0 , P 1 68.0% Unigrams 66.9% Sval-1 56.3% 66.7% 67.9% 57.0% 43.6% P -1 ,P 0 , P 1 55.3% Unigrams 55.3% Sval-2 47.7% Best Optimal Ours Base Set 2 Set 1 Data
  • 23. Conclusions
    • Reasonable amount of complementarity across lexical and syntactic features.
    • Simple lexical and part of speech features can be combined to achieve state of the art results.
    • Future Work : How best to capitalize on the complementarity?
  • 24. Senseval-3
    • Approx. 8000 training and 4000 test instances.
      • English lexical sample task.
    • Training data collected via Open Mind Word Expert.
    • Comparative results unveiled at ACL workshop!
  • 25. Software and Data
    • SyntaLex : WSD using lexical and syntactic features.
    • posSenseval : POS tag data in Senseval-2 format using Brill Tagger.
    • parseSenseval : parse output from Brill Tagger using Collins Parser.
    • BrillPatch : Supports Guaranteed Pre-Tagging.
    • Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats.
    • http://www.d.umn.edu/~tpederse/code.html
    • http://www.d.umn.edu/~tpederse/data.html
  • 26. Individual Word POS : Senseval-1 64.3% 58.2% 62.2% 59.2% P -1 64.3% 58.2% 62.5% 60.3% P 0 66.2% 64.4% 65.4% 63.9% P 1 64.0 58.6% 58.2% 57.5% P -2 65.2% 60.8% 60.0% 59.9% P -2 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
  • 27. Individual Word POS: Senseval-2 59.0% 40.2% 55.2% 49.6% P -1 58.2% 40.6% 55.7% 49.9% P 0 61.0% 49.1% 53.8% 53.1% P 1 57.9% 38.0% 51.9% 47.1% P -2 59.4% 43.2% 50.2% 48.9% P -2 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All
  • 28. Parse Features: Senseval-1 65.8% 60.3% 62.6% 60.6% Parent Word 66.2% 57.2% 57.5% 58.5% Phrase 66.2% 58.3% 58.1% 57.9% Parent Phrase 66.9% 59.8% 70.9% 64.3% Head Word 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
  • 29. Parse Features: Senseval-2 59.3% 40.1% 56.1% 50.0% Parent 59.5% 40.3% 51.7% 48.3% Phrase 60.3% 39.1% 53.0% 48.5% Parent Phrase 64.0% 39.8% 58.5% 51.7% Head 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All