Your SlideShare is downloading. ×
Presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Presentation

333

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
333
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 1. In case of neural n/w for example, the learned model is quite meaningless.
  • 1. Bigrams and unigrams, (interest rate) and (rate) suggest the financial sense of interest.
  • If we know the pos of certain words, pretagging such words can improve overall quality of pos tagging by the automatic tagger. Note we are no longer confident of the quality of tagging around target word in case of mistags. We found a lot of such mis-taggings of the head words in Sval-1 and 2 data (5% of head words had radical mistags and 20% mistags in all (radical and subtle)). So we decided to find out why this was happening and hopefully do something abt it.
  • Notice the different tag sets on the right of turn . P0, P-2 etc have similar meanings By combination I mean one tree where the nodes may be any of the different pos features: P0 or P1 or P-2 and so on.
  • We wanted to utilize the guaranteed pre-tagging for a higher quality parsing. Head and parent words are marked in red and all 4 of them suggest a particular sense of hard and line . The hard work --- not easy, difficult sense The hard surface --- not soft, physical sense Fasten and cross are the parents of the noun phrases “the line” Fasten the line --- cord sense Cross the line --- division sense
  • Sval-1 (2-24) and Sval-2 (2-32) data created such that target words with varying number of senses are represented. Sval-1 annotated with senses from HECTOR, Sval-2 from WordNet. 2. Interest data created by Bruce and Weibe from penn treebank and WSJ (ACL/DCI version) Annotated with 6 senses from LDOCE 3. Serve data created by Leacock Chodrow from WSJ (1987-89) and APHB corpus. Annotated with four senses from WordNet. 4. Hard data created by Leacock Chodrow from SJM corpus. Annotated with three senses from WordNet. 5. line data created by Leacock et al. from WSJ (1987-89) and APHB corpus. Annotated with 6 senses from WordNet.
  • Surface form does not do much better than baseline. Unigrams and Bigrams both do significantly well (esp. considering they are lexical features, easily captured).
  • We have improvements over baseline (much is not expected as we are using just individual pos) Interestingly P1 is found to be best (we found this in all data) Break down into individual pos shows that … Verbs and adjectives do best with P1 Verb-object relations is in effect getting captured. Nouns are helped by pos tags on either side Subj-verb and verb-object relation (hence both sides help).
  • 1. Similar results as in Sval-1.
  • Simple comb of pos ftrs does almost as well as unigrams and bigrams. Note, much lower number of features utilized as compared to unigrams and bigrams. P0,P1 found to be most potent combination for Sval-1 and 2. Larger context found to be much more helpful for line, hard, serve and interest data as compared to the Sval data. We think that this is because of the much larger amounts of training data.
  • Head found to be best Verbs are usually head themselves and hence the head ftr is not very useful for them. Parent found to do reasonable well.
  • 1. Similar results as last slide.
  • Optimal ensemble is the upper bound for accuracy achivable by an ensemble technique. One tree with all feature may yield even better results but we cannot say much about that and is beyond the scope of this work.
  • Note: reasonable amount of redundancy (Base): that was expected Note: the simple ensemble does slightly better than individual features in case of line and hard data it does worse (not sure why) Suggests that a powerful ensemble technique is desirable Note: the large amounts of complementarity as suggested by the optimal ensemble values which are around the best achieved so far. Combination of simple lexical and syntactic features can results close to state of art.
  • Transcript

    • 1. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse
    • 2. Word Sense Disambiguation
      • Harry cast a bewitching spell
      • Humans immediately understand spell to mean a charm or incantation.
        • reading out letter by letter or a period of time ?
          • Words with multiple senses – polysemy , ambiguity!
        • Utilize background knowledge and context.
      • Machines lack background knowledge.
        • A utomatically i dentifying the intended sense of a word in written text, based on its context , remain s a hard problem.
        • Best accuracies in recent international event, around 65%.
    • 3. Why do we need WSD !
      • Information Retrieval
        • Query: cricket bat
          • Documents pertaining to the insect and the mammal, irrelevant.
      • Machine Translation
        • Consider English to Hindi translation.
          • head to sar (upper part of the body) or adhyaksh (leader)?
      • Machine-hu man interaction
        • Instructions to machines.
          • Interactive home system: turn on the lights
          • Domestic Android: get the door
      • Applications are widespread and will affect our way of life.
    • 4. Terminology
      • Harry cast a bewitching spell
      • Target word – the word whose intended sense is to be identified.
        • spell
      • Context – the sentence housing the target word and possibly, 1 or 2 sentences around it.
        • Harry cast a bewitching spell
      • Instance – target word along with its context.
      • WSD is a classification problem wherein the occurrence of the
      • target word is assigned to one of its many possible senses.
    • 5. Corpus-Based Supervised Machine Learning
      • A computer program is said to learn from experience … if its performance at tasks … improves with experience .
      • - Mitchell
      • Task : Word Sense Disambiguation of given test instances.
      • Performance : Ratio of instances correctly disambiguated to the total test instances – accuracy.
      • Experience : Manually created instances such that target words are marked with intended sense – training instances.
        • Harry cast a bewitching spell / incantation
    • 6. Decision Trees
      • A kind of classifier.
        • Assigns a class by asking a series of questions.
        • Questions correspond to features of the instance.
        • Question asked depends on answer to previous question.
      • Inverted tree structure.
        • Interconnected nodes.
          • Top most node is called the root.
        • Each node corresponds to a question / feature.
        • Each possible value of feature has corresponding branch.
        • Leaves terminate every path from root.
          • Each leaf is associated with a class.
    • 7. WSD Tree Feature 4? Feature 4 ? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3 SENSE 2 SENSE 1 SENSE 3 SENSE 3 0 0 0 1 1 1 0 1 0 1 0 1 Feature 1 ? SENSE 1
    • 8. Choice of Learning Algorithm
      • Why use decision trees for WSD ?
        • It has drawbacks – training data fragmentation
        • What about other learning algorithms such as neural networks?
      • Context is a rich source of discrete features.
      • The learned model likely meaningful.
        • May provide insight into the interaction of features.
      • Pedersen[2001]*: Choosing the right features is of
      • greater significance than the learning algorithm itself
      • A Decision Tree of Bigrams is an Accurate Predictor of Word Sense T. Pedersen, In the Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics
      • (NAACL-01), June 2-7, 2001, Pittsburgh, PA.
    • 9. Lexical Features
      • Surface form
        • A word we observe in text.
        • Case(n)
          • 1. Object of investigation 2. frame or covering 3. A weird person
          • Surface forms : case , cases , casing
          • An occurrence of casing suggests sense 2.
      • Unigrams and Bigrams
        • One word and two word sequences in text.
        • The interest rate is low
        • Unigrams: the, interest, rate, is, low
        • Bigrams: the interest, interest rate, rate is, is low
    • 10. Part of Speech Tagging
      • Brill Tagger – most widely used tool.
        • Accuracy around 95%.
        • Source code available.
        • Easily understood rules .
      • Pre-tagging is the act of manually assigning tags to selected words in a text prior to tagging.
        • Brill tagger does not guaranteed pre-tagging.
        • A patch to the tagger provided – BrillPatch*.
        • * ”Guaranteed Pre-Tagging for the Brill Tagger ”, Mohammad, S. and Pedersen, T., In Proceedings of Fourth International Conference of Intelligent Systems and Text Processing , February 2003, Mexico.
    • 11. Part of Speech Features
      • A word used in different senses is likely to have different sets of pos tags around it.
      • Why did jack turn /VB against /IN his /PRP$ team /NN
      • Why did jack turn /VB left /NN at /IN the /DT crossing
      • Features used
        • Individual word POS: P -2 , P -1 , P 0 , P 1 , P 2
          • P 1 = JJ implies that the word to the right of the target word is an adjective.
        • A combination of the above.
    • 12. Parse Features
      • Collins Parser * used to parse the data.
        • Source code available.
        • Uses part of speech tagged data as input.
      • Head word of a phrase.
        • the hard work , the hard surface
        • Phrase itself : noun phrase, verb phrase and so on.
      • Parent : Head word of the parent phrase.
        • fasten the line , cross the line
        • Parent phrase.
      • * http://www.ai.mit.edu/people/mcollins
    • 13. Sample Parse Tree VERB PHRASE NOUN PHRASE Harry NOUN PHRASE SENTENCE spell cast a bewitching NNP VBD DT JJ NN
    • 14. Sense-Tagged Data
      • Senseval-2 data
        • 4,328 instances of test data and 8,611 instances of training data ranging over 73 different noun, verb and adjectives.
      • Senseval-1 data
        • 8,512 test instances and 13,276 training instances, ranging over 35 nouns, verbs and adjectives.
      • line, hard, interest, serve data
        • 4149, 4337, 4378 and 2476 sense-tagged instances with line, hard, serve and interest as the head words.
        • Around 50,000 sense-tagged instances in all!
    • 15. Experiments
    • 16. Lexical: Senseval-1 & Senseval-2 72.9% 74.5% 54.3% 54.3% line 66.9% 66.9% 62.9% 56.3% Sval-1 89.5% 83.4% 81.5% 81.5% hard 72.1% 73.3% 44.2% 42.2% serve 79.9% 55.1% Bigram 75.7% 55.3% Unigram 64.0% 49.3% Surface Form 54.9% 47.7% Majority interest Sval-2
    • 17. Individual Word POS (Senseval-1) 64.3% 58.2% 62.2% 59.2% P -1 64.3% 58.2% 62.5% 60.3% P 0 66.2% 64.4% 65.4% 63.9% P 1 64.0 58.6% 58.2% 57.5% P -2 65.2% 60.8% 60.0% 59.9% P -2 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
    • 18. Individual Word POS (Senseval-2) 59.0% 40.2% 55.2% 49.6% P -1 58.2% 40.6% 55.7% 49.9% P 0 61.0% 49.1% 53.8% 53.1% P 1 57.9% 38.0% 51.9% 47.1% P -2 59.4% 43.2% 50.2% 48.9% P -2 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All
    • 19. Combining POS Features 62.3% 60.4% 54.1% 54.3% line 86.2% 84.8% 81.9% 81.5% hard 75.7% 73.0% 60.2% 42.2% serve 67.8% 68.0% 66.7% 56.3% Sval-1 80.6% 78.8% 70.5% 54.9% interest 54.6% P -2 , P -1 , P 0 , P 1 , P 2 54.6% P -1 , P 0 , P 1 54.3% P 0 , P 1 47.7% Majority Sval-2
    • 20. Parse Features (Senseval-1) 65.8% 60.3% 62.6% 60.6% Parent 66.2% 57.2% 57.5% 58.5% Phrase 66.2% 58.3% 58.1% 57.9% Par. Phr. 66.9% 59.8% 70.9% 64.3% Head 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
    • 21. Parse Features (Senseval-2) 59.3% 40.1% 56.1% 50.0% Parent 59.5% 40.3% 51.7% 48.3% Phrase 60.3% 39.1% 53.0% 48.5% Par. Phr. 64.0% 39.8% 58.5% 51.7% Head 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All
    • 22. Thoughts…
      • Both lexical and syntactic features perform comparably.
      • But do they get the same instances right ?
        • How much are the individual feature sets redundant.
      • Are there instances correctly disambiguated by one feature set and not by the other ?
        • How much are the individual feature sets complementary.
        • Is the effort to combine of lexical and syntactic
        • features justified?
    • 23. Measures
      • Baseline Ensemble : accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so.
        • Quantifies redundancy amongst feature sets.
      • Optimal Ensemble : a ccuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so.
        • Difference with individual accuracies quantifies complementarity.
      • We used a simple ensemble which sums up the
      • probabilities for each sense by the individual feature
      • sets to decide the intended sense.
    • 24. Best Combinations 89.0% 90.1% 83.2% 67.6% P -1 ,P 0 , P 1 78.8% Bigrams 79.9% interest 54.9% 83.0% 89.9% 81.6% 58.4% P -1 ,P 0 , P 1 73.0% Unigrams 73.3% serve 42.2% 83.0% 91.3% 88.9% 86.1% Head, Par 87.7% Bigrams 89.5% hard 81.5% 88.0% 82.0% 74.2% 55.1% P -1 ,P 0 , P 1 60.4% Unigrams 74.5% line 54.3% 81.1% 78.0% 71.1% 57.6% P -1 ,P 0 , P 1 68.0% Unigrams 66.9% Sval-1 56.3% 66.7% 67.9% 57.0% 43.6% P -1 ,P 0 , P 1 55.3% Unigrams 55.3% Sval-2 47.7% Best Opt. Ens. Base Set 2 Set 1 Data
    • 25. Conclusions
      • Significant amount of complementarity across lexical and syntactic features.
      • Combination of the two justified.
      • We show that simple lexical and part of speech features can achieve state of the art results.
      • How best to capitalize on the complementarity still an open issue.
    • 26. Conclusions (continued)
      • Part of speech of word immediately to the right of target word found most useful.
        • Pos of words immediately to the right of target word best for verbs and adjectives.
        • Nouns helped by tags on either side.
        • (P 0 , P 1 ) found to be most potent in case of small training data per instance (Sval data) .
        • Larger pos context size (P -2 , P -1 , P 0 , P 1 , P 2 ) shown to be beneficial when training data per instance is large (line, hard, serve and interest data)
      • Head word of phrase particularly useful for adjectives
        • Nouns helped by both head and parent.
    • 27. Code, Data & Resources
      • SyntaLex : A system to do WSD using lexical and syntactic features. Weka’s decision tree learning algorithm is utilized.
      • posSenseval : part of speech tags any data in Senseval-2 data format. Brill Tagger used.
      • parseSenseval : parses data in a format as output by the Brill Tagger. Output is in Senseval-2 data format with part of speech and parse information as xml tags. Uses Collins Parser.
      • Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats.
      • BrillPatch : Patch to Brill Tagger to employ Guaranteed
      • Pre-Tagging.
      • http://www.d.umn.edu/~tpederse/code.html
      • http://www.d.umn.edu/~tpederse/data.html
    • 28. Senseval-3 (Mar-1 to April 15, 2004) Around 8000 training and 4000 test instances. Results expected shortly. Thank You

    ×