Advances In Wsd Aaai 2005


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Long term goal – all words WSD Co-training / self-training are known to improve over basic classifiers Explore their applicability to the problem of WSD
  • Advances In Wsd Aaai 2005

    1. 1. Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen University of Minnesota, Duluth
    2. 2. Goal of the Tutorial <ul><li>Introduce the problem of word sense disambiguation (WSD), focusing on the range of formulations and approaches currently practiced. </li></ul><ul><li>Accessible to anyone with an interest in NLP or AI. </li></ul><ul><li>Persuade you to work on word sense disambiguation </li></ul><ul><ul><li>It’s an interesting problem </li></ul></ul><ul><ul><li>Lots of good work already done, still more to do </li></ul></ul><ul><ul><li>There is infrastructure to help you get started </li></ul></ul><ul><li>Persuade you to use word sense disambiguation in your text applications. </li></ul>
    3. 3. Outline of Tutorial <ul><li>Introduction (Ted) </li></ul><ul><li>Methodolodgy (Rada) </li></ul><ul><li>Knowledge Intensive Methods (Rada) </li></ul><ul><li>Supervised Approaches (Ted) </li></ul><ul><li>Minimally Supervised Approaches (Rada) / BREAK </li></ul><ul><li>Unsupervised Learning (Ted) </li></ul><ul><li>How to Get Started (Rada) </li></ul><ul><li>Conclusion (Ted) </li></ul>
    4. 4. Part 1: Introduction
    5. 5. Outline <ul><li>Definitions </li></ul><ul><li>Ambiguity for Humans and Computers </li></ul><ul><li>Very Brief Historical Overview </li></ul><ul><li>Theoretical Connections </li></ul><ul><li>Practical Applications </li></ul>
    6. 6. Definitions <ul><li>Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. </li></ul><ul><ul><li>Sense Inventory usually comes from a dictionary or thesaurus. </li></ul></ul><ul><ul><li>Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches </li></ul></ul><ul><li>Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. </li></ul><ul><ul><li>Unsupervised techniques </li></ul></ul>
    7. 7. Outline <ul><li>Definitions </li></ul><ul><li>Ambiguity for Humans and Computers </li></ul><ul><li>Very Brief Historical Overview </li></ul><ul><li>Theoretical Connections </li></ul><ul><li>Practical Applications </li></ul>
    8. 8. Computers versus Humans <ul><li>Polysemy – most words have many possible meanings. </li></ul><ul><li>A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human… </li></ul><ul><li>Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases… </li></ul>
    9. 9. Ambiguity for Humans - Newspaper Headlines! <ul><li>DRUNK GETS NINE YEARS IN VIOLIN CASE </li></ul><ul><li>FARMER BILL DIES IN HOUSE </li></ul><ul><li>PROSTITUTES APPEAL TO POPE </li></ul><ul><li>STOLEN PAINTING FOUND BY TREE </li></ul><ul><li>RED TAPE HOLDS UP NEW BRIDGE </li></ul><ul><li>DEER KILL 300,000 </li></ul><ul><li>RESIDENTS CAN DROP OFF TREES </li></ul><ul><li>INCLUDE CHILDREN WHEN BAKING COOKIES </li></ul><ul><li>MINERS REFUSE TO WORK AFTER DEATH </li></ul>
    10. 10. Ambiguity for a Computer <ul><li>The fisherman jumped off the bank and into the water. </li></ul><ul><li>The bank down the street was robbed! </li></ul><ul><li>Back in the day, we had an entire bank of computers devoted to this problem. </li></ul><ul><li>The bank in that road is entirely too steep and is really dangerous. </li></ul><ul><li>The plane took a bank to the left, and then headed off towards the mountains. </li></ul>
    11. 11. Outline <ul><li>Definitions </li></ul><ul><li>Ambiguity for Humans and Computers </li></ul><ul><li>Very Brief Historical Overview </li></ul><ul><li>Theoretical Connections </li></ul><ul><li>Practical Applications </li></ul>
    12. 12. Early Days of WSD <ul><li>Noted as problem for Machine Translation (Weaver, 1949) </li></ul><ul><ul><li>A word can often only be translated if you know the specific sense intended (A bill in English could be a pico or a cuenta in Spanish) </li></ul></ul><ul><li>Bar-Hillel (1960) posed the following: </li></ul><ul><ul><li>Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy. </li></ul></ul><ul><ul><li>Is “pen” a writing instrument or an enclosure where children play? </li></ul></ul><ul><li>… declared it unsolvable, left the field of MT! </li></ul>
    13. 13. Since then… <ul><li>1970s - 1980s </li></ul><ul><ul><li>Rule based systems </li></ul></ul><ul><ul><li>Rely on hand crafted knowledge sources </li></ul></ul><ul><li>1990s </li></ul><ul><ul><li>Corpus based approaches </li></ul></ul><ul><ul><li>Dependence on sense tagged text </li></ul></ul><ul><ul><li>(Ide and Veronis, 1998) overview history from early days to 1998. </li></ul></ul><ul><li>2000s </li></ul><ul><ul><li>Hybrid Systems </li></ul></ul><ul><ul><li>Minimizing or eliminating use of sense tagged text </li></ul></ul><ul><ul><li>Taking advantage of the Web </li></ul></ul>
    14. 14. Outline <ul><li>Definitions </li></ul><ul><li>Ambiguity for Humans and Computers </li></ul><ul><li>Very Brief Historical Overview </li></ul><ul><li>Interdisciplinary Connections </li></ul><ul><li>Practical Applications </li></ul>
    15. 15. Interdisciplinary Connections <ul><li>Cognitive Science & Psychology </li></ul><ul><ul><li>Quillian (1968), Collins and Loftus (1975) : spreading activation </li></ul></ul><ul><ul><ul><li>Hirst (1987) developed marker passing model </li></ul></ul></ul><ul><li>Linguistics </li></ul><ul><ul><li>Fodor & Katz (1963) : selectional preferences </li></ul></ul><ul><ul><ul><li>Resnik (1993) pursued statistically </li></ul></ul></ul><ul><li>Philosophy of Language </li></ul><ul><ul><li>Wittgenstein (1958): meaning as use </li></ul></ul><ul><ul><li>“ For a large class of cases-though not for all-in which we employ the word &quot;meaning&quot; it can be defined thus: the meaning of a word is its use in the language.” </li></ul></ul>
    16. 16. Outline <ul><li>Definitions </li></ul><ul><li>Ambiguity for Humans and Computers </li></ul><ul><li>Very Brief Historical Overview </li></ul><ul><li>Theoretical Connections </li></ul><ul><li>Practical Applications </li></ul>
    17. 17. Practical Applications <ul><li>Machine Translation </li></ul><ul><ul><li>Translate “bill” from English to Spanish </li></ul></ul><ul><ul><ul><li>Is it a “pico” or a “cuenta”? </li></ul></ul></ul><ul><ul><ul><li>Is it a bird jaw or an invoice? </li></ul></ul></ul><ul><li>Information Retrieval </li></ul><ul><ul><li>Find all Web Pages about “cricket” </li></ul></ul><ul><ul><ul><li>The sport or the insect? </li></ul></ul></ul><ul><li>Question Answering </li></ul><ul><ul><li>What is George Miller’s position on gun control? </li></ul></ul><ul><ul><ul><li>The psychologist or US congressman? </li></ul></ul></ul><ul><li>Knowledge Acquisition </li></ul><ul><ul><li>Add to KB: Herb Bergson is the mayor of Duluth. </li></ul></ul><ul><ul><ul><li>Minnesota or Georgia? </li></ul></ul></ul>
    18. 18. References <ul><li>(Bar-Hillel, 1960) The Present Status of Automatic Translations of Languages. In Advances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp 91-163. </li></ul><ul><li>(Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory. Psychological Review, (82) pp. 407-428. </li></ul><ul><li>(Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170-210. </li></ul><ul><li>(Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge University Press. </li></ul><ul><li>(Ide and Véronis, 1998)Word Sense Disambiguation: The State of the Art. . Computational Linguistics (24 ) pp 1-40. </li></ul><ul><li>(Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M. (editor). The MIT Press, Cambridge, MA. pp. 227-270. </li></ul><ul><li>(Resnik, 1993) Selection and Information: A Class-Based Approach to Lexical Relationships. Ph.D. Dissertation. University of Pennsylvania. </li></ul><ul><li>(Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays. Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15-23. </li></ul><ul><li>(Wittgenstein, 1958) Philosophical Investigations, 3 rd edition. Translated by G.E.M. Anscombe. Macmillan Publishing Co., New York. </li></ul>
    19. 19. Part 2: Methodology
    20. 20. Outline <ul><li>General considerations </li></ul><ul><li>All-words disambiguation </li></ul><ul><li>Targeted-words disambiguation </li></ul><ul><li>Word sense discrimination, sense discovery </li></ul><ul><li>Evaluation (granularity, scoring) </li></ul>
    21. 21. Overview of the Problem <ul><li>Many words have several meanings (homonymy / polysemy) </li></ul><ul><li>D etermine which sense of a word is used in a specific sentence </li></ul><ul><li>Note: </li></ul><ul><ul><li>often, the different senses of a word are closely related </li></ul></ul><ul><ul><ul><li>Ex: title - right of legal ownership </li></ul></ul></ul><ul><ul><ul><li> - document that is evidence of the legal ownership, </li></ul></ul></ul><ul><ul><li>sometimes, several senses can be “activated” in a single context (co-activation) </li></ul></ul><ul><ul><ul><li>Ex: “This could bring competition to the trade” </li></ul></ul></ul><ul><ul><ul><ul><li>competition : - the act of competing </li></ul></ul></ul></ul><ul><ul><ul><ul><li>- the people who are competing </li></ul></ul></ul></ul><ul><ul><li>Ex: “ chair ” – furniture or person </li></ul></ul><ul><ul><li>Ex: “ child ” – young person or human offspring </li></ul></ul>
    22. 22. Word Senses <ul><li>The meaning of a word in a given context </li></ul><ul><li>Word sense representations </li></ul><ul><ul><li>With respect to a dictionary </li></ul></ul><ul><ul><ul><li>chair = a seat for one person, with a support for the back; &quot;he put his coat over the back of the chair and sat down&quot; </li></ul></ul></ul><ul><ul><ul><li>chair = the position of professor; &quot;he was awarded an endowed chair in economics&quot; </li></ul></ul></ul><ul><ul><li>With respect to the translation in a second language </li></ul></ul><ul><ul><ul><li>chair = chaise </li></ul></ul></ul><ul><ul><ul><li>chair = directeur </li></ul></ul></ul><ul><ul><li>With respect to the context where it occurs (discrimination) </li></ul></ul><ul><ul><ul><li>“ Sit on a chair ” “Take a seat on this chair ” </li></ul></ul></ul><ul><ul><ul><li>“ The chair of the Math Department” “The chair of the meeting” </li></ul></ul></ul>
    23. 23. Approaches to Word Sense Disambiguation <ul><li>Knowledge-Based Disambiguation </li></ul><ul><ul><li>use of external lexical resources such as dictionaries and thesauri </li></ul></ul><ul><ul><li>discourse properties </li></ul></ul><ul><li>Supervised Disambiguation </li></ul><ul><ul><li>based on a labeled training set </li></ul></ul><ul><ul><li>the learning system has: </li></ul></ul><ul><ul><ul><li>a training set of feature-encoded inputs AND </li></ul></ul></ul><ul><ul><ul><li>their appropriate sense label (category) </li></ul></ul></ul><ul><li>Unsupervised Disambiguation </li></ul><ul><ul><li>based on unlabeled corpora </li></ul></ul><ul><ul><li>The learning system has: </li></ul></ul><ul><ul><ul><li>a training set of feature-encoded inputs BUT </li></ul></ul></ul><ul><ul><ul><li>NOT their appropriate sense label (category) </li></ul></ul></ul>
    24. 24. All Words Word Sense Disambiguation <ul><li>Attempt to disambiguate all open-class words in a text </li></ul><ul><ul><li>“He put his suit over the back of the chair ” </li></ul></ul><ul><li>Knowledge-based approaches </li></ul><ul><li>Use information from dictionaries </li></ul><ul><ul><li>Definitions / Examples for each meaning </li></ul></ul><ul><ul><ul><li>Find similarity between definitions and current context </li></ul></ul></ul><ul><li>Position in a semantic network </li></ul><ul><ul><ul><li>Find that “ table ” is closer to “ chair/furniture ” than to “ chair/person ” </li></ul></ul></ul><ul><li>Use discourse properties </li></ul><ul><ul><ul><li>A word exhibits the same sense in a discourse / in a collocation </li></ul></ul></ul>
    25. 25. All Words Word Sense Disambiguation <ul><li>Minimally supervised approaches </li></ul><ul><ul><li>Learn to disambiguate words using small annotated corpora </li></ul></ul><ul><ul><li>E.g. SemCor – corpus where all open class words are disambiguated </li></ul></ul><ul><ul><ul><li>200,000 running words </li></ul></ul></ul><ul><li>Most frequent sense </li></ul>
    26. 26. Targeted Word Sense Disambiguation <ul><li>Disambiguate one target word </li></ul><ul><ul><ul><li>“ Take a seat on this chair ” </li></ul></ul></ul><ul><ul><ul><li>“ The chair of the Math Department” </li></ul></ul></ul><ul><li>WSD is viewed as a typical classification problem </li></ul><ul><ul><li>use machine learning techniques to train a system </li></ul></ul><ul><li>Training: </li></ul><ul><ul><li>Corpus of occurrences of the target word, each occurrence annotated with appropriate sense </li></ul></ul><ul><ul><li>Build feature vectors: </li></ul></ul><ul><ul><ul><li>a vector of relevant linguistic features that represents the context (ex: a window of words around the target word) </li></ul></ul></ul><ul><li>Disambiguation: </li></ul><ul><ul><li>Disambiguate the target word in new unseen text </li></ul></ul>
    27. 27. Targeted Word Sense Disambiguation <ul><li>Take a window of n word around the target word </li></ul><ul><li>Encode information about the words around the target word </li></ul><ul><ul><li>typical features include: words, root forms, POS tags, frequency, … </li></ul></ul><ul><ul><ul><li>An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. </li></ul></ul></ul><ul><ul><ul><li>Surrounding context (local features) </li></ul></ul></ul><ul><ul><ul><ul><li>[ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ] </li></ul></ul></ul></ul><ul><ul><ul><li>Frequent co-occurring words (topical features) </li></ul></ul></ul><ul><ul><ul><ul><li>[ fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band ] </li></ul></ul></ul></ul><ul><ul><ul><ul><li>[0,0,0,1,0,0,0,0,0,0,1,0] </li></ul></ul></ul></ul><ul><ul><ul><li>Other features: </li></ul></ul></ul><ul><ul><ul><ul><li>[followed by &quot;player&quot;, contains &quot;show&quot; in the sentence,…] </li></ul></ul></ul></ul><ul><ul><ul><ul><li>[yes, no , … ] </li></ul></ul></ul></ul>
    28. 28. Unsupervised Disambiguation <ul><li>Disambiguate word senses: </li></ul><ul><ul><li>without supporting tools such as dictionaries and thesauri </li></ul></ul><ul><ul><li>without a labeled training text </li></ul></ul><ul><li>Without such resources, word senses are not labeled </li></ul><ul><ul><li>We cannot say “ chair/furniture” or “ chair/person” </li></ul></ul><ul><li>We can: </li></ul><ul><ul><li>Cluster/group the contexts of an ambiguous word into a number of groups </li></ul></ul><ul><ul><li>Discriminate between these groups without actually labeling them </li></ul></ul>
    29. 29. Unsupervised Disambiguation <ul><li>Hypothesis: same senses of words will have similar neighboring words </li></ul><ul><li>Disambiguation algorithm </li></ul><ul><ul><li>Identify context vectors corresponding to all occurrences of a particular word </li></ul></ul><ul><ul><li>Partition them into regions of high density </li></ul></ul><ul><ul><li>Assign a sense to each such region </li></ul></ul><ul><ul><ul><li>“Sit on a chair ” </li></ul></ul></ul><ul><ul><ul><li>“Take a seat on this chair ” </li></ul></ul></ul><ul><ul><ul><li>“The chair of the Math Department” </li></ul></ul></ul><ul><ul><ul><li>“The chair of the meeting” </li></ul></ul></ul>
    30. 30. Evaluating Word Sense Disambiguation <ul><li>Metrics: </li></ul><ul><ul><li>Precision = percentage of words that are tagged correctly, out of the words addressed by the system </li></ul></ul><ul><ul><li>Recall = percentage of words that are tagged correctly, out of all words in the test set </li></ul></ul><ul><ul><li>Example </li></ul></ul><ul><ul><ul><li>Test set of 100 words Precision = 50 / 75 = 0.66 </li></ul></ul></ul><ul><ul><ul><li>System attempts 75 words Recall = 50 / 100 = 0.50 </li></ul></ul></ul><ul><ul><ul><li>Words correctly disambiguated 50 </li></ul></ul></ul><ul><li>Special tags are possible: </li></ul><ul><ul><li>Unknown </li></ul></ul><ul><ul><li>Proper noun </li></ul></ul><ul><ul><li>Multiple senses </li></ul></ul><ul><li>Compare to a gold standard </li></ul><ul><ul><li>SEMCOR corpus, SENSEVAL corpus, … </li></ul></ul>
    31. 31. Evaluating Word Sense Disambiguation <ul><li>Difficulty in evaluation: </li></ul><ul><ul><li>Nature of the senses to distinguish has a huge impact on results </li></ul></ul><ul><li>Coarse versus fine-grained sense distinction </li></ul><ul><ul><ul><li>chair = a seat for one person, with a support for the back; &quot;he put his coat over the back of the chair and sat down“ </li></ul></ul></ul><ul><ul><ul><li>chair = the position of professor ; &quot;he was awarded an endowed chair in economics“ </li></ul></ul></ul><ul><ul><ul><li>bank = a financial institution that accepts deposits and channels the money into lending activities; &quot;he cashed a check at the bank&quot;; &quot;that bank holds the mortgage on my home&quot; </li></ul></ul></ul><ul><ul><ul><li>bank = a building in which commercial banking is transacted; &quot;the bank is on the corner of Nassau and Witherspoon“ </li></ul></ul></ul><ul><li>Sense maps </li></ul><ul><ul><li>Cluster similar senses </li></ul></ul><ul><ul><li>Allow for both fine-grained and coarse-grained evaluation </li></ul></ul>
    32. 32. Bounds on Performance <ul><li>Upper and Lower Bounds on Performance: </li></ul><ul><ul><li>Measure of how well an algorithm performs relative to the difficulty of the task. </li></ul></ul><ul><li>Upper Bound: </li></ul><ul><ul><li>Human performance </li></ul></ul><ul><ul><li>Around 97%-99% with few and clearly distinct senses </li></ul></ul><ul><ul><li>Inter-judge agreement: </li></ul></ul><ul><ul><ul><li>With words with clear & distinct senses – 95% and up </li></ul></ul></ul><ul><ul><ul><li>With polysemous words with related senses – 65% – 70% </li></ul></ul></ul><ul><li>Lower Bound (or baseline): </li></ul><ul><ul><li>The assignment of a random sense / the most frequent sense </li></ul></ul><ul><ul><ul><li>90% is excellent for a word with 2 equiprobable senses </li></ul></ul></ul><ul><ul><ul><li>90% is trivial for a word with 2 senses with probability ratios of 9 to 1 </li></ul></ul></ul>
    33. 33. References <ul><li>(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs ACL 1992 . </li></ul><ul><li>(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification . ARPA Workshop 1994. </li></ul><ul><li>(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. </li></ul><ul><li>(Senseval) Senseval evaluation exercises </li></ul>
    34. 34. Part 3: Knowledge-based Methods for Word Sense Disambiguation
    35. 35. Outline <ul><li>Task definition </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><li>Algorithms based on Machine Readable Dictionaries </li></ul><ul><li>Selectional Restrictions </li></ul><ul><li>Measures of Semantic Similarity </li></ul><ul><li>Heuristic-based Methods </li></ul>
    36. 36. Task Definition <ul><li>Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text </li></ul><ul><li>Resources </li></ul><ul><ul><li>Yes </li></ul></ul><ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul></ul><ul><ul><ul><li>Raw corpora </li></ul></ul></ul><ul><ul><li>No </li></ul></ul><ul><ul><ul><li>Manually annotated corpora </li></ul></ul></ul><ul><li>Scope </li></ul><ul><ul><li>All open-class words </li></ul></ul>
    37. 37. Machine Readable Dictionaries <ul><li>In recent years, most dictionaries made available in Machine Readable format (MRD) </li></ul><ul><ul><li>Oxford English Dictionary </li></ul></ul><ul><ul><li>Collins </li></ul></ul><ul><ul><li>Longman Dictionary of Ordinary Contemporary English (LDOCE) </li></ul></ul><ul><li>Thesauruses – add synonymy information </li></ul><ul><ul><li>Roget Thesaurus </li></ul></ul><ul><li>Semantic networks – add more semantic relations </li></ul><ul><ul><li>WordNet </li></ul></ul><ul><ul><li>EuroWordNet </li></ul></ul>
    38. 38. MRD – A Resource for Knowledge-based WSD <ul><li>For each word in the language vocabulary, an MRD provides: </li></ul><ul><ul><li>A list of meanings </li></ul></ul><ul><ul><li>Definitions (for all word meanings) </li></ul></ul><ul><ul><li>Typical usage examples (for most word meanings) </li></ul></ul><ul><li>WordNet definitions/examples for the noun plant </li></ul><ul><ul><li>buildings for carrying on industrial labor; &quot;they built a large plant to manufacture automobiles“ </li></ul></ul><ul><ul><li>a living organism lacking the power of locomotion </li></ul></ul><ul><ul><li>something planted secretly for discovery by another; &quot;the police used a plant to trick the thieves&quot;; &quot;he claimed that the evidence against him was a plant&quot; </li></ul></ul><ul><ul><li>an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience </li></ul></ul>
    39. 39. MRD – A Resource for Knowledge-based WSD <ul><li>A thesaurus adds: </li></ul><ul><ul><li>An explicit synonymy relation between word meanings </li></ul></ul><ul><li>A semantic network adds: </li></ul><ul><ul><li>Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. </li></ul></ul>WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}
    40. 40. Outline <ul><li>Task definition </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><li>Algorithms based on Machine Readable Dictionaries </li></ul><ul><li>Selectional Restrictions </li></ul><ul><li>Measures of Semantic Similarity </li></ul><ul><li>Heuristic-based Methods </li></ul>
    41. 41. Lesk Algorithm <ul><li>(Michael Lesk 1986): Identify senses of words in context using definition overlap </li></ul><ul><li>Algorithm: </li></ul><ul><ul><li>Retrieve from MRD all sense definitions of the words to be disambiguated </li></ul></ul><ul><ul><li>Determine the definition overlap for all possible sense combinations </li></ul></ul><ul><ul><li>Choose senses that lead to highest overlap </li></ul></ul><ul><li>Example: disambiguate PINE CONE </li></ul><ul><li>PINE </li></ul><ul><ul><li>1. kinds of evergreen tree with needle-shaped leaves </li></ul></ul><ul><ul><li>2. waste away through sorrow or illness </li></ul></ul><ul><li>CONE </li></ul><ul><ul><li>1. solid body which narrows to a point </li></ul></ul><ul><ul><li>2. something of this shape whether solid or hollow </li></ul></ul><ul><ul><li>3. fruit of certain evergreen trees </li></ul></ul>Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0
    42. 42. Lesk Algorithm for More than Two Words? <ul><li>I saw a man who is 98 years old and can still walk and tell jokes </li></ul><ul><ul><li>nine open class words: see (26), man (11), year (4), old (8), can (5), still (4), walk (10), tell (8), joke (3) </li></ul></ul><ul><li>43,929,600 sense combinations! How to find the optimal sense combination? </li></ul><ul><li>Simulated annealing (Cowie, Guthrie, Guthrie 1992) </li></ul><ul><ul><li>Define a function E = combination of word senses in a given text. </li></ul></ul><ul><ul><li>Find the combination of senses that leads to highest definition overlap ( redundancy) </li></ul></ul><ul><ul><li>1. Start with E = the most frequent sense for each word </li></ul></ul><ul><ul><li>2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E </li></ul></ul><ul><ul><li>3. Stop iterating when there is no change in the configuration of senses </li></ul></ul>
    43. 43. Lesk Algorithm: A Simplified Version <ul><li>Original Lesk definition: measure overlap between sense definitions for all words in context </li></ul><ul><ul><li>Identify simultaneously the correct senses for all words in context </li></ul></ul><ul><li>Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context </li></ul><ul><ul><li>Identify the correct sense for one word at a time </li></ul></ul><ul><li>Search space significantly reduced </li></ul>
    44. 44. Lesk Algorithm: A Simplified Version <ul><li>Example: disambiguate PINE in </li></ul><ul><li>“ Pine cones hanging in a tree” </li></ul><ul><li>PINE </li></ul><ul><ul><li>1. kinds of evergreen tree with needle-shaped leaves </li></ul></ul><ul><ul><li>2. waste away through sorrow or illness </li></ul></ul>Pine#1  Sentence = 1 Pine#2  Sentence = 0 <ul><li>Algorithm for simplified Lesk: </li></ul><ul><ul><li>Retrieve from MRD all sense definitions of the word to be disambiguated </li></ul></ul><ul><ul><li>Determine the overlap between each sense definition and the current context </li></ul></ul><ul><ul><li>Choose the sense that leads to highest overlap </li></ul></ul>
    45. 45. Evaluations of Lesk Algorithm <ul><li>Initial evaluation by M. Lesk </li></ul><ul><ul><li>50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary </li></ul></ul><ul><li>Simulated annealing </li></ul><ul><ul><li>47% on 50 manually annotated sentences </li></ul></ul><ul><li>Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) </li></ul><ul><ul><li>Original Lesk: 35% </li></ul></ul><ul><ul><li>Simplified Lesk: 47% </li></ul></ul><ul><li>Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) </li></ul><ul><ul><li>Original Lesk: 42% </li></ul></ul><ul><ul><li>Simplified Lesk: 58% </li></ul></ul>
    46. 46. Outline <ul><li>Task definition </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><li>Algorithms based on Machine Readable Dictionaries </li></ul><ul><li>Selectional Preferences </li></ul><ul><li>Measures of Semantic Similarity </li></ul><ul><li>Heuristic-based Methods </li></ul>
    47. 47. Selectional Preferences <ul><li>A way to constrain the possible meanings of words in a given context </li></ul><ul><li>E.g. “ Wash a dish ” vs. “ Cook a dish ” </li></ul><ul><ul><li>WASH-OBJECT vs. COOK-FOOD </li></ul></ul><ul><li>Capture information about possible relations between semantic classes </li></ul><ul><ul><li>Common sense knowledge </li></ul></ul><ul><li>Alternative terminology </li></ul><ul><ul><li>Selectional Restrictions </li></ul></ul><ul><ul><li>Selectional Preferences </li></ul></ul><ul><ul><li>Selectional Constraints </li></ul></ul>
    48. 48. Acquiring Selectional Preferences <ul><li>From annotated corpora </li></ul><ul><ul><li>Circular relationship with the WSD problem </li></ul></ul><ul><ul><ul><li>Need WSD to build the annotated corpus </li></ul></ul></ul><ul><ul><ul><li>Need selectional preferences to derive WSD </li></ul></ul></ul><ul><li>From raw corpora </li></ul><ul><ul><li>Frequency counts </li></ul></ul><ul><ul><li>Information theory measures </li></ul></ul><ul><ul><li>Class-to-class relations </li></ul></ul>
    49. 49. Preliminaries: Learning Word-to-Word Relations <ul><li>An indication of the semantic fit between two words </li></ul><ul><li>1. Frequency counts </li></ul><ul><ul><li>Pairs of words connected by a syntactic relations </li></ul></ul><ul><li>2. Conditional probabilities </li></ul><ul><ul><li>Condition on one of the words </li></ul></ul>
    50. 50. Learning Selectional Preferences (1) <ul><li>Word-to-class relations (Resnik 1993) </li></ul><ul><ul><li>Quantify the contribution of a semantic class using all the concepts subsumed by that class </li></ul></ul><ul><ul><li>where </li></ul></ul>
    51. 51. Learning Selectional Preferences (2) <ul><li>Determine the contribution of a word sense based on the assumption of equal sense distributions: </li></ul><ul><ul><li>e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 </li></ul></ul><ul><li>Example: learning restrictions for the verb “ to drink ” </li></ul><ul><ul><li>Find high-scoring verb-object pairs </li></ul></ul><ul><ul><li>Find “prototypical” object classes (high association score) </li></ul></ul>
    52. 52. Learning Selectional Preferences (3) <ul><li>Other algorithms </li></ul><ul><ul><li>Learn class-to-class relations (Agirre and Martinez, 2002) </li></ul></ul><ul><ul><ul><li>E.g.: “ingest food” is a class-to-class relation for “eat chicken” </li></ul></ul></ul><ul><ul><li>Bayesian networks (Ciaramita and Johnson, 2000) </li></ul></ul><ul><ul><li>Tree cut model (Li and Abe, 1998) </li></ul></ul>
    53. 53. Using Selectional Preferences for WSD <ul><li>Algorithm: </li></ul><ul><ul><li>1. Learn a large set of selectional preferences for a given syntactic relation R </li></ul></ul><ul><ul><li>2. Given a pair of words W 1 – W 2 connected by a relation R </li></ul></ul><ul><ul><li>3. Find all selectional preferences W 1 – C (word-to-class) or C 1 – C 2 (class-to-class) that apply </li></ul></ul><ul><ul><li>4. Select the meanings of W 1 and W 2 based on the selected semantic class </li></ul></ul><ul><li>Example: disambiguate coffee in “drink coffee ” </li></ul><ul><ul><li>1. (beverage) a beverage consisting of an infusion of ground coffee beans </li></ul></ul><ul><ul><li>2. (tree) any of several small trees native to the tropical Old World </li></ul></ul><ul><ul><li>3. (color) a medium to dark brown color </li></ul></ul>Given the selectional preference “DRINK BEVERAGE” : coffee#1
    54. 54. Evaluation of Selectional Preferences for WSD <ul><li>Data set </li></ul><ul><ul><li>mainly on verb-object, subject-verb relations extracted from SemCor </li></ul></ul><ul><li>Compare against random baseline </li></ul><ul><li>Results (Agirre and Martinez, 2000) </li></ul><ul><ul><li>Average results on 8 nouns </li></ul></ul><ul><ul><li>Similar figures reported in (Resnik 1997) </li></ul></ul>
    55. 55. Outline <ul><li>Task definition </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><li>Algorithms based on Machine Readable Dictionaries </li></ul><ul><li>Selectional Restrictions </li></ul><ul><li>Measures of Semantic Similarity </li></ul><ul><li>Heuristic-based Methods </li></ul>
    56. 56. Semantic Similarity <ul><li>Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) </li></ul><ul><li>Use this property for WSD – Identify related meanings for words that share a common context </li></ul><ul><li>Context span: </li></ul><ul><li>1. Local context: semantic similarity between pairs of words </li></ul><ul><li>2. Global context: lexical chains </li></ul>
    57. 57. Semantic Similarity in a Local Context <ul><li>Similarity determined between pairs of concepts, or between a word and its surrounding context </li></ul><ul><li>Relies on similarity metrics on semantic networks </li></ul><ul><ul><li>(Rada et al. 1989) </li></ul></ul>carnivore wild dog wolf bear feline, felid canine, canid fissiped mamal, fissiped dachshund hunting dog hyena dog dingo hyena dog terrier
    58. 58. Semantic Similarity Metrics (1) <ul><li>Input: two concepts (same part of speech) </li></ul><ul><li>Output: similarity measure </li></ul><ul><li>(Leacock and Chodorow 1998) </li></ul><ul><ul><li>E.g. Similarity( wolf , dog ) = 0.60 Similarity( wolf , bear ) = 0.42 </li></ul></ul><ul><li>(Resnik 1995) </li></ul><ul><ul><li>Define information content, P(C) = probability of seeing a concept of type C in a large corpus </li></ul></ul><ul><ul><li>Probability of seeing a concept = probability of seeing instances of that concept </li></ul></ul><ul><ul><li>Determine the contribution of a word sense based on the assumption of equal sense distributions: </li></ul></ul><ul><ul><ul><li>e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 </li></ul></ul></ul>, D is the taxonomy depth
    59. 59. Semantic Similarity Metrics (2) <ul><li>Similarity using information content </li></ul><ul><ul><li>(Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) </li></ul></ul><ul><ul><li>Alternatives (Jiang and Conrath 1997) </li></ul></ul><ul><li>Other metrics: </li></ul><ul><ul><li>Similarity using information content (Lin 1998) </li></ul></ul><ul><ul><li>Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) </li></ul></ul><ul><ul><li>Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) </li></ul></ul><ul><ul><li>Adapted Lesk algorithm (Banerjee and Pedersen 2002) </li></ul></ul>
    60. 60. Semantic Similarity Metrics for WSD <ul><li>Disambiguate target words based on similarity with one word to the left and one word to the right </li></ul><ul><ul><li>(Patwardhan, Banerjee, Pedersen 2002) </li></ul></ul><ul><li>Evaluation: </li></ul><ul><ul><li>1,723 ambiguous nouns from Senseval-2 </li></ul></ul><ul><ul><li>Among 5 similarity metrics, (Jiang and Conrath 1997) provide the best precision (39%) </li></ul></ul><ul><li>Example: disambiguate PLANT in “plant with flowers” </li></ul><ul><li>PLANT </li></ul><ul><li>plant, works, industrial plant </li></ul><ul><li>plant, flora, plant life </li></ul><ul><li>Similarity (plant#1, flower) = 0.2 </li></ul><ul><li>Similarity (plant#2, flower) = 1.5 : plant#2 </li></ul>
    61. 61. Semantic Similarity in a Global Context <ul><li>Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) </li></ul><ul><li>“ A lexical chain is a s equence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse ” </li></ul><ul><li>Algorithm for finding lexical chains: </li></ul><ul><li>Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. </li></ul><ul><li>For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. </li></ul><ul><li>If such a chain is found, insert the word in this chain; otherwise, create a new chain. </li></ul>
    62. 62. Semantic Similarity of a Global Context A very long train traveling along the rails with a constant velocity v in a certain direction … train #1: public transport #2: order set of things #3: piece of cloth travel #1 change location #2: undergo transportation rail #1: a barrier # 2: a bar of steel for trains #3: a small bird
    63. 63. Lexical Chains for WSD <ul><li>Identify lexical chains in a text </li></ul><ul><ul><li>Usually target one part of speech at a time </li></ul></ul><ul><li>Identify the meaning of words based on their membership to a lexical chain </li></ul><ul><li>Evaluation: </li></ul><ul><ul><li>(Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09% </li></ul></ul><ul><ul><li>(Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall </li></ul></ul><ul><ul><ul><li>lexical chains “anchored” on monosemous words </li></ul></ul></ul><ul><ul><li>(Okumura and Honda 1994) lexical chains on five Japanese texts give 63.4% </li></ul></ul>
    64. 64. Outline <ul><li>Task definition </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><li>Algorithms based on Machine Readable Dictionaries </li></ul><ul><li>Selectional Restrictions </li></ul><ul><li>Measures of Semantic Similarity </li></ul><ul><li>Heuristic-based Methods </li></ul>
    65. 65. Most Frequent Sense (1) <ul><li>Identify the most often used meaning and use this meaning by default </li></ul><ul><li>Word meanings exhibit a Zipfian distribution </li></ul><ul><ul><li>E.g. distribution of word senses in SemCor </li></ul></ul><ul><ul><li>Example: “plant/flora” is used more often than “plant/factory” </li></ul></ul><ul><ul><li>- annotate any instance of PLANT as “plant/flora” </li></ul></ul>
    66. 66. Most Frequent Sense (2) <ul><li>Method 1: Find the most frequent sense in an annotated corpus </li></ul><ul><li>Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) </li></ul><ul><ul><li>1. Given a word w , find the top k distributionally similar words </li></ul></ul><ul><ul><li>N w = {n 1 , n 2 , …, n k }, with associated similarity scores {dss(w,n 1 ), dss(w,n 2 ), … dss(w,n k )} </li></ul></ul><ul><ul><li>2. For each sense ws i of w, identify the similarity with the words n j , using the sense of n j that maximizes this score </li></ul></ul><ul><ul><li>3. Rank senses ws i of w based on the total similarity score </li></ul></ul>
    67. 67. Most Frequent Sense(3) <ul><li>Word senses </li></ul><ul><ul><li>pipe #1 = tobacco pipe </li></ul></ul><ul><ul><li>pipe #2 = tube of metal or plastic </li></ul></ul><ul><li>Distributional similar words </li></ul><ul><ul><li>N = { tube, cable, wire, tank, hole, cylinder, fitting, tap , …} </li></ul></ul><ul><li>For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) </li></ul><ul><ul><li>pipe#1 – tube (#3) = 0.3 </li></ul></ul><ul><ul><li>pipe#2 – tube (#1) = 0.6 </li></ul></ul><ul><li>Compute score for each sense pipe#i </li></ul><ul><ul><li>score ( pipe#1 ) = 0.25 </li></ul></ul><ul><ul><li>score ( pipe#2 ) = 0.73 </li></ul></ul><ul><li>Note : results depend on the corpus used to find distributionally </li></ul><ul><li>similar words => can find domain specific predominant senses </li></ul>
    68. 68. One Sense Per Discourse <ul><li>A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) </li></ul><ul><li>What does this mean? </li></ul><ul><li>Evaluation: </li></ul><ul><ul><li>8 words with two-way ambiguity, e.g. plant , crane , etc. </li></ul></ul><ul><ul><li>98% of the two-word occurrences in the same discourse carry the same meaning </li></ul></ul><ul><li>The grain of salt: Performance depends on granularity </li></ul><ul><ul><li>(Krovetz 1998) experiments with words with more than two senses </li></ul></ul><ul><ul><li>Performance of “one sense per discourse” measured on SemCor is approx. 70% </li></ul></ul><ul><ul><li>E.g. The ambiguous word PLANT occurs 10 times in a discourse </li></ul></ul><ul><ul><li>all instances of “plant” carry the same meaning </li></ul></ul>
    69. 69. One Sense per Collocation <ul><li>A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) </li></ul><ul><ul><li>Strong for adjacent collocations </li></ul></ul><ul><ul><li>Weaker as the distance between words increases </li></ul></ul><ul><li>An example </li></ul><ul><li>Evaluation: </li></ul><ul><ul><li>97% precision on words with two-way ambiguity </li></ul></ul><ul><li>Finer granularity: </li></ul><ul><ul><li>(Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses </li></ul></ul><ul><ul><li>70% precision on SemCor words </li></ul></ul><ul><ul><li>The ambiguous word PLANT preserves its meaning in all its </li></ul></ul><ul><ul><li>occurrences within the collocation “industrial plant”, regardless </li></ul></ul><ul><ul><li>of the context where this collocation occurs </li></ul></ul>
    70. 70. References <ul><li>(Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance . RANLP 1995 .   </li></ul><ul><li>(Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences . CONLL 2001. </li></ul><ul><li>  (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet . CICLING 2002. </li></ul><ul><li>(Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing . COLING 2002. </li></ul><ul><li>(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse . DARPA workshop 1992 . </li></ul><ul><li>(Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English. Longman. </li></ul><ul><li>(Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense disambiguation in lexical chaining. IJCAI 2003 </li></ul><ul><li>(Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms . WordNet: An electronic lexical database , MIT Press. </li></ul><ul><li>(Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy . COLING 1997. </li></ul><ul><li>(Krovetz, 1998) Krovetz, R. More than one sense per discourse . ACL-SIGLEX 1998. </li></ul><ul><li>(Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone . SIGDOC 1986. </li></ul><ul><li>(Lin 1998) Lin, D An information theoretic definition of similarity . ICML 1998. </li></ul>
    71. 71. References <ul><li>(Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations . EMNLP 2000. </li></ul><ul><li>(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification . ARPA Workshop 1994. </li></ul><ul><li>(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. </li></ul><ul><li>(Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text . ACL 1999. </li></ul><ul><li>(Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation . FLAIRS 2000. </li></ul><ul><li>(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004. </li></ul><ul><li>(Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation . CICLING 2003. </li></ul><ul><li>(Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets . IEEE Transactions on Systems, Man, and Cybernetics, 19(1) 1989. </li></ul><ul><li>(Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships . University of Pennsylvania 1993.   </li></ul><ul><li>(Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity . IJCAI 1995. </li></ul><ul><li>(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme &quot;Evaluating variants of the Lesk approach for disambiguating words”, LREC 2004. </li></ul><ul><li>(Yarowsky, 1993) Yarowsky, D. One sense per collocation . ARPA Workshop 1993. </li></ul>
    72. 72. Part 4: Supervised Methods of Word Sense Disambiguation
    73. 73. Outline <ul><li>What is Supervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Single Classifiers </li></ul><ul><ul><li>Naïve Bayesian Classifiers </li></ul></ul><ul><ul><li>Decision Lists and Trees </li></ul></ul><ul><li>Ensembles of Classifiers </li></ul>
    74. 74. What is Supervised Learning? <ul><li>Collect a set of examples that illustrate the various possible classifications or outcomes of an event. </li></ul><ul><li>Identify patterns in the examples associated with each particular class of the event. </li></ul><ul><li>Generalize those patterns into rules. </li></ul><ul><li>Apply the rules to classify a new event. </li></ul>
    75. 75. Learn from these examples : “when do I go to the store?” YES NO NO NO 4 NO NO NO YES 3 YES NO YES NO 2 NO NO YES YES 1 F3 Ate Well? F2 Slept Well? F1 Hot Outside? CLASS Go to Store? Day
    76. 76. Learn from these examples : “when do I go to the store?” YES NO NO NO 4 NO NO NO YES 3 YES NO YES NO 2 NO NO YES YES 1 F3 Ate Well? F2 Slept Well? F1 Hot Outside? CLASS Go to Store? Day
    77. 77. Outline <ul><li>What is Supervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Single Classifiers </li></ul><ul><ul><li>Naïve Bayesian Classifiers </li></ul></ul><ul><ul><li>Decision Lists and Trees </li></ul></ul><ul><li>Ensembles of Classifiers </li></ul>
    78. 78. Task Definition <ul><li>Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. </li></ul><ul><li>Resources </li></ul><ul><ul><li>Sense Tagged Text </li></ul></ul><ul><ul><li>Dictionary (implicit source of sense inventory) </li></ul></ul><ul><ul><li>Syntactic Analysis (POS tagger, Chunker, Parser, …) </li></ul></ul><ul><li>Scope </li></ul><ul><ul><li>Typically one target word per context </li></ul></ul><ul><ul><li>Part of speech of target word resolved </li></ul></ul><ul><ul><li>Lends itself to “targeted word” formulation </li></ul></ul><ul><li>Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs </li></ul>
    79. 79. Sense Tagged Text My bank/1 charges too much for an overdraft. The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River. My grandfather planted his pole in the bank/2 and got a great big catfish! The bank/2 is pretty muddy, I can’t walk there. I went to the bank/1 to deposit my check and get a new ATM card. Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
    80. 80. Two Bags of Words (Co-occurrences in the “window of context”) RIVER_BANK_BAG: a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West FINANCIAL_BANK_BAG: a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
    81. 81. Simple Supervised Approach <ul><li>Given a sentence S containing “bank”: </li></ul><ul><li>For each word W i in S </li></ul><ul><li>If W i is in FINANCIAL_BANK_BAG then </li></ul><ul><li>Sense_1 = Sense_1 + 1; </li></ul><ul><li>If W i is in RIVER_BANK_BAG then </li></ul><ul><li>Sense_2 = Sense_2 + 1; </li></ul><ul><li>If Sense_1 > Sense_2 then print “Financial” </li></ul><ul><li>else if Sense_2 > Sense_1 then print “River” </li></ul><ul><li>else print “Can’t Decide”; </li></ul>
    82. 82. Supervised Methodology <ul><li>Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities. </li></ul><ul><ul><li>One tagged word per instance/lexical sample disambiguation </li></ul></ul><ul><li>Select a set of features with which to represent context. </li></ul><ul><ul><li>co-occurrences, collocations, POS tags, verb-obj relations, etc... </li></ul></ul><ul><li>Convert sense-tagged training instances to feature vectors. </li></ul><ul><li>Apply a machine learning algorithm to induce a classifier. </li></ul><ul><ul><li>Form – structure or relation among features </li></ul></ul><ul><ul><li>Parameters – strength of feature interactions </li></ul></ul><ul><li>Convert a held out sample of test data into feature vectors. </li></ul><ul><ul><li>“ correct” sense tags are known but not used </li></ul></ul><ul><li>Apply classifier to test instances to assign a sense tag. </li></ul>
    83. 83. From Text to Feature Vectors <ul><li>My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/SHORE of/prep the/det Mississippi/noun River/noun. (S1) </li></ul><ul><li>The/det bank/FINANCE issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2) </li></ul>FINANCE Y N Y N det verb det S2 SHORE N Y N Y det prep det adv S1 SENSE TAG interest river check fish P+2 P+1 P-1 P-2
    84. 84. Supervised Learning Algorithms <ul><li>Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results: </li></ul><ul><ul><li>Support Vector Machines </li></ul></ul><ul><ul><li>Nearest Neighbor Classifiers </li></ul></ul><ul><ul><li>Decision Trees </li></ul></ul><ul><ul><li>Decision Lists </li></ul></ul><ul><ul><li>Naïve Bayesian Classifiers </li></ul></ul><ul><ul><li>Perceptrons </li></ul></ul><ul><ul><li>Neural Networks </li></ul></ul><ul><ul><li>Graphical Models </li></ul></ul><ul><ul><li>Log Linear Models </li></ul></ul>
    85. 85. Outline <ul><li>What is Supervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Naïve Bayesian Classifier </li></ul><ul><li>Decision Lists and Trees </li></ul><ul><li>Ensembles of Classifiers </li></ul>
    86. 86. Naïve Bayesian Classifier <ul><li>Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997) </li></ul><ul><li>… Word Sense Disambiguation is no exception </li></ul><ul><li>Assumes conditional independence among features, given the sense of a word. </li></ul><ul><ul><li>The form of the model is assumed, but parameters are estimated from training instances </li></ul></ul><ul><li>When applied to WSD, features are often “a bag of words” that come from the training data </li></ul><ul><ul><li>Usually thousands of binary features that indicate if a word is present in the context of the target word (or not) </li></ul></ul>
    87. 87. Bayesian Inference <ul><li>Given observed features, what is most likely sense? </li></ul><ul><li>Estimate probability of observed features given sense </li></ul><ul><li>Estimate unconditional probability of sense </li></ul><ul><li>Unconditional probability of features is a normalizing term, doesn’t affect sense classification </li></ul>
    88. 88. Naïve Bayesian Model
    89. 89. The Naïve Bayesian Classifier <ul><ul><li>Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) </li></ul></ul><ul><ul><ul><li>P(S=1) = 1,500/2000 = .75 </li></ul></ul></ul><ul><ul><ul><li>P(S=2) = 500/2,000 = .25 </li></ul></ul></ul><ul><ul><li>Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. </li></ul></ul><ul><ul><ul><li>P(F1=“credit”) = 204/2000 = .102 </li></ul></ul></ul><ul><ul><ul><li>P(F1=“credit”|S=1) = 200/1,500 = .133 </li></ul></ul></ul><ul><ul><ul><li>P(F1=“credit”|S=2) = 4/500 = .008 </li></ul></ul></ul><ul><ul><li>Given a test instance that has one feature “credit” </li></ul></ul><ul><ul><ul><li>P(S=1|F1=“credit”) = .133*.75/.102 = .978 </li></ul></ul></ul><ul><ul><ul><li>P(S=2|F1=“credit”) = .008*.25/.102 = .020 </li></ul></ul></ul>
    90. 90. Comparative Results <ul><li>(Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line… </li></ul><ul><li>(Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line … </li></ul><ul><li>(Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words… </li></ul><ul><li>… All found that Naïve Bayesian Classifier performed as well as any of the other methods! </li></ul>
    91. 91. Outline <ul><li>What is Supervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Naïve Bayesian Classifiers </li></ul><ul><li>Decision Lists and Trees </li></ul><ul><li>Ensembles of Classifiers </li></ul>
    92. 92. Decision Lists and Trees <ul><li>Very widely used in Machine Learning. </li></ul><ul><li>Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988). </li></ul><ul><li>Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word. </li></ul><ul><ul><li>List decides between two senses after one positive answer </li></ul></ul><ul><ul><li>Tree allows for decision among multiple senses after a series of answers </li></ul></ul><ul><li>Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes. </li></ul><ul><ul><li>More descriptive and easier to interpret. </li></ul></ul>
    93. 93. Decision List for WSD (Yarowsky, 1994) <ul><li>Identify collocational features from sense tagged data. </li></ul><ul><li>Word immediately to the left or right of target : </li></ul><ul><ul><li>I have my bank/1 statement . </li></ul></ul><ul><ul><li>The river bank/2 is muddy. </li></ul></ul><ul><li>Pair of words to immediate left or right of target : </li></ul><ul><ul><li>The world’s richest bank/1 is here in New York. </li></ul></ul><ul><ul><li>The river bank/2 is muddy. </li></ul></ul><ul><li>Words found within k positions to left or right of target, where k is often 10-50 : </li></ul><ul><ul><li>My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low. </li></ul></ul>
    94. 94. Building the Decision List <ul><li>Sort order of collocation tests using log of conditional probabilities. </li></ul><ul><li>Words most indicative of one sense (and not the other) will be ranked highly. </li></ul>
    95. 95. Computing DL score <ul><ul><li>Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) </li></ul></ul><ul><ul><ul><li>P(S=1) = 1,500/2,000 = .75 </li></ul></ul></ul><ul><ul><ul><li>P(S=2) = 500/2,000 = .25 </li></ul></ul></ul><ul><ul><li>Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. </li></ul></ul><ul><ul><ul><li>P(F1=“credit”) = 204/2,000 = .102 </li></ul></ul></ul><ul><ul><ul><li>P(F1=“credit”|S=1) = 200/1,500 = .133 </li></ul></ul></ul><ul><ul><ul><li>P(F1=“credit”|S=2) = 4/500 = .008 </li></ul></ul></ul><ul><ul><li>From Bayes Rule… </li></ul></ul><ul><ul><ul><li>P(S=1|F1=“credit”) = .133*.75/.102 = .978 </li></ul></ul></ul><ul><ul><ul><li>P(S=2|F1=“credit”) = .008*.25/.102 = .020 </li></ul></ul></ul><ul><ul><li>DL Score = abs (log (.978/.020)) = 3.89 </li></ul></ul>
    96. 96. Using the Decision List <ul><li>Sort DL-score, go through test instance looking for matching feature. First match reveals sense… </li></ul>N/A of the bank 0.00 Bank/2 river pole within bank 1.09 Bank/2 river bank is muddy 2.20 Bank/1 financial credit within bank 3.89 Sense Feature DL-score
    97. 97. Using the Decision List
    98. 98. Learning a Decision Tree <ul><li>Identify the feature that most “cleanly” divides the training data into the known senses. </li></ul><ul><ul><li>“ Cleanly” measured by information gain or gain ratio. </li></ul></ul><ul><ul><li>Create subsets of training data according to feature values. </li></ul></ul><ul><li>Find another feature that most cleanly divides a subset of the training data. </li></ul><ul><li>Continue until each subset of training data is “pure” or as clean as possible. </li></ul><ul><li>Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993) </li></ul><ul><li>In Senseval-1, a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000) </li></ul>
    99. 99. Supervised WSD with Individual Classifiers <ul><li>Many supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. </li></ul><ul><ul><li>(Witten and Frank, 2000) is a great intro. to supervised learning. </li></ul></ul><ul><li>Features tend to differentiate among methods more than the learning algorithms. </li></ul><ul><li>Good sets of features tend to include: </li></ul><ul><ul><li>Co-occurrences or keywords (global) </li></ul></ul><ul><ul><li>Collocations (local) </li></ul></ul><ul><ul><li>Bigrams (local and global) </li></ul></ul><ul><ul><li>Part of speech (local) </li></ul></ul><ul><ul><li>Predicate-argument relations </li></ul></ul><ul><ul><ul><li>Verb-object, subject-verb, </li></ul></ul></ul><ul><ul><li>Heads of Noun and Verb Phrases </li></ul></ul>
    100. 100. Convergence of Results <ul><li>Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. </li></ul><ul><ul><li>Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task. </li></ul></ul><ul><ul><li>Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. </li></ul></ul><ul><ul><li>Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task… </li></ul></ul><ul><li>What to do next? </li></ul>
    101. 101. Outline <ul><li>What is Supervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Naïve Bayesian Classifiers </li></ul><ul><li>Decision Lists and Trees </li></ul><ul><li>Ensembles of Classifiers </li></ul>
    102. 102. Ensembles of Classifiers <ul><li>Classifier error has two components (Bias and Variance) </li></ul><ul><ul><li>Some algorithms (e.g., decision trees) try and build a representation of the training data – Low Bias/High Variance </li></ul></ul><ul><ul><li>Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance </li></ul></ul><ul><li>Combining classifiers with different bias variance characteristics can lead to improved overall accuracy </li></ul><ul><li>“ Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996) </li></ul><ul><ul><li>Sample with replacement from the training data to learn multiple decision trees. </li></ul></ul><ul><ul><li>Outliers in training data will tend to be obscured/eliminated. </li></ul></ul>
    103. 103. Ensemble Considerations <ul><li>Must choose different learning algorithms with significantly different bias/variance characteristics. </li></ul><ul><ul><li>Naïve Bayesian Classifier versus Decision Tree </li></ul></ul><ul><li>Must choose feature representations that yield significantly different (independent?) views of the training data. </li></ul><ul><ul><li>Lexical versus syntactic features </li></ul></ul><ul><li>Must choose how to combine classifiers. </li></ul><ul><ul><li>Simple Majority Voting </li></ul></ul><ul><ul><li>Averaging of probabilities across multiple classifier output </li></ul></ul><ul><ul><li>Maximum Entropy combination (e.g., Klein, et. al., 2002) </li></ul></ul>
    104. 104. Ensemble Results <ul><li>(Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers. </li></ul><ul><ul><li>Many Naïve Bayesian Classifiers trained on varying sized windows of context / bags of words. </li></ul></ul><ul><ul><li>Classifiers combined by a weighted vote </li></ul></ul><ul><li>(Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers. </li></ul><ul><ul><li>Rich set of collocational and syntactic features. </li></ul></ul><ul><ul><li>Combined via linear combination of top three classifiers. </li></ul></ul><ul><li>Many Senseval-2 and Senseval-3 systems employed ensemble methods. </li></ul>
    105. 105. References <ul><li>(Black, 1988) An experiment in computational discrimination of English word senses. IBM Journal of Research and Development (32) pg. 185-194. </li></ul><ul><li>(Breiman, 1996) The heuristics of instability in model selection. Annals of Statistics (24) pg. 2350-2383. </li></ul><ul><li>(Domingos and Pazzani, 1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning (29) pg. 103-130. </li></ul><ul><li>(Domingos, 2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In Proceedings of AAAI. Pg. 564-569. </li></ul><ul><li>(Florian an dYarowsky, 2002) Modeling Consensus: Classifier Combination for Word Sense Disambiguation. In Proceedings of EMNLP, pp 25-32. </li></ul><ul><li>(Kelly and Stone, 1975). Computer Recognition of English Word Senses, North Holland Publishing Co., Amsterdam. </li></ul><ul><li>(Klein, et. al., 2002) Combining Heterogeneous Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg. 87-89. </li></ul><ul><li>(Leacock, et. al. 1993) Corpus based statistical sense resolution. In Proceedings of the ARPA Workshop on Human Language Technology. pg. 260-265. </li></ul><ul><li>(Mooney, 1996) Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91. </li></ul>
    106. 106. References <ul><li>(Pedersen, 1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D. Dissertation. Southern Methodist University. </li></ul><ul><li>(Pedersen, 2000) A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of NAACL. </li></ul><ul><li>(Quillian, 1986). Induction of Decision Trees. Machine Learning (1). pg. 81-106. </li></ul><ul><li>(Quillian, 1993). C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann. </li></ul><ul><li>(Witten and Frank, 2000). Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan-Kaufmann. San Francisco. </li></ul><ul><li>(Yarowsky, 1994) Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of ACL. pp. 88-95. </li></ul><ul><li>(Yarowsky, 2000) Hierarchical decision lists for word sense disambiguation. Computers and the Humanities, 34. </li></ul>
    107. 107. Part 5: Minimally Supervised Methods for Word Sense Disambiguation
    108. 108. Outline <ul><li>Task definition </li></ul><ul><ul><li>What does “minimally” supervised mean? </li></ul></ul><ul><li>Bootstrapping algorithms </li></ul><ul><ul><li>Co-training </li></ul></ul><ul><ul><li>Self-training </li></ul></ul><ul><ul><li>Yarowsky algorithm </li></ul></ul><ul><li>Using the Web for Word Sense Disambiguation </li></ul><ul><ul><li>Web as a corpus </li></ul></ul><ul><ul><li>Web as collective mind </li></ul></ul>
    109. 109. Task Definition <ul><li>Supervised WSD = learning sense classifiers starting with annotated data </li></ul><ul><li>Minimally supervised WSD = learning sense classifiers from annotated data, with minimal human supervision </li></ul><ul><li>Examples </li></ul><ul><ul><li>Automatically bootstrap a corpus starting with a few human annotated examples </li></ul></ul><ul><ul><li>Use monosemous relatives / dictionary definitions to automatically construct sense tagged data </li></ul></ul><ul><ul><li>Rely on Web-users + active learning for corpus annotation </li></ul></ul>
    110. 110. Outline <ul><li>Task definition </li></ul><ul><ul><li>What does “minimally” supervised mean? </li></ul></ul><ul><li>Bootstrapping algorithms </li></ul><ul><ul><li>Co-training </li></ul></ul><ul><ul><li>Self-training </li></ul></ul><ul><ul><li>Yarowsky algorithm </li></ul></ul><ul><li>Using the Web for Word Sense Disambiguation </li></ul><ul><ul><li>Web as a corpus </li></ul></ul><ul><ul><li>Web as collective mind </li></ul></ul>
    111. 111. Bootstrapping WSD Classifiers <ul><li>Build sense classifiers with little training data </li></ul><ul><ul><li>Expand applicability of supervised WSD </li></ul></ul><ul><li>Bootstrapping approaches </li></ul><ul><ul><li>Co-training </li></ul></ul><ul><ul><li>Self-training </li></ul></ul><ul><ul><li>Yarowsky algorithm </li></ul></ul>
    112. 112. Bootstrapping Recipe <ul><li>Ingredients </li></ul><ul><ul><li>(Some) labeled data </li></ul></ul><ul><ul><li>(Large amounts of) unlabeled data </li></ul></ul><ul><ul><li>(One or more) basic classifiers </li></ul></ul><ul><li>Output </li></ul><ul><ul><li>Classifier that improves over the basic classifiers </li></ul></ul>
    113. 113. … plants#1 and animals … … industry plant#2 … … building the only atomic plant … … plant growth is retarded … … a herb or flowering plant … … a nuclear power plant … … building a new vehicle plant … … the animal and plant life … … the passion-fruit plant … Classifier 1 Classifier 2 … plant#1 growth is retarded … … a nuclear power plant#2 …
    114. 114. Co-training / Self-training <ul><li>1. Create a pool of examples U' </li></ul><ul><ul><li>choose P random examples from U </li></ul></ul><ul><li>2. Loop for I iterations </li></ul><ul><ul><li>Train C i on L and label U' </li></ul></ul><ul><ul><li>Select G most confident examples and add to L </li></ul></ul><ul><ul><ul><li>maintain distribution in L </li></ul></ul></ul><ul><ul><li>Refill U' with examples from U </li></ul></ul><ul><ul><ul><li>keep U' at constant size P </li></ul></ul></ul><ul><ul><li>A set L of labeled training examples </li></ul></ul><ul><ul><li>A set U of unlabeled examples </li></ul></ul><ul><ul><li>Classifiers C i </li></ul></ul>
    115. 115. <ul><li>(Blum and Mitchell 1998) </li></ul><ul><li>Two classifiers </li></ul><ul><ul><li>independent views </li></ul></ul><ul><ul><li>[independence condition can be relaxed] </li></ul></ul><ul><li>Co-training in Natural Language Learning </li></ul><ul><ul><li>Statistical parsing (Sarkar 2001) </li></ul></ul><ul><ul><li>Co-reference resolution (Ng and Cardie 2003) </li></ul></ul><ul><ul><li>Part of speech tagging (Clark, Curran and Osborne 2003) </li></ul></ul><ul><ul><li>... </li></ul></ul>Co-training
    116. 116. Self-training <ul><li>(Nigam and Ghani 2000) </li></ul><ul><li>One single classifier </li></ul><ul><li>Retrain on its own output </li></ul><ul><li>Self-training for Natural Language Learning </li></ul><ul><ul><li>Part of speech tagging (Clark, Curran and Osborne 2003) </li></ul></ul><ul><ul><li>Co-reference resolution (Ng and Cardie 2003) </li></ul></ul><ul><ul><ul><li>several classifiers through bagging </li></ul></ul></ul>
    117. 117. Parameter Setting for Co-training/Self-training <ul><li>1. Create a pool of examples U' </li></ul><ul><ul><li>choose P random examples from U </li></ul></ul><ul><li>2. Loop for I iterations </li></ul><ul><ul><li>Train C i on L and label U' </li></ul></ul><ul><ul><li>Select G most confident examples and add to L </li></ul></ul><ul><ul><ul><li>maintain distribution in L </li></ul></ul></ul><ul><ul><li>Refill U' with examples from U </li></ul></ul><ul><ul><ul><li>keep U' at constant size P </li></ul></ul></ul><ul><li>A major drawback of bootstrapping </li></ul><ul><ul><li>“ No principled method for selecting optimal values for these </li></ul></ul><ul><ul><li>parameters” (Ng and Cardie 2003) </li></ul></ul>
    118. 118. Experiments with Co-training / Self-training for WSD <ul><li>Training / Test data </li></ul><ul><ul><li>Senseval-2 nouns (29 ambiguous nouns) </li></ul></ul><ul><ul><li>Average corpus size: 95 training examples, 48 test examples </li></ul></ul><ul><li>Raw data </li></ul><ul><ul><li>British National Corpus </li></ul></ul><ul><ul><li>Average corpus size: 7,085 examples </li></ul></ul><ul><li>Co-training </li></ul><ul><ul><li>Two classifiers: local and topical classifiers </li></ul></ul><ul><li>Self-training </li></ul><ul><ul><li>One classifier: global classifier </li></ul></ul><ul><li>(Mihalcea 2004) </li></ul>
    119. 119. Parameter Settings <ul><li>Parameter ranges </li></ul><ul><ul><li>P = {1, 100, 500, 1000, 1500, 2000, 5000} </li></ul></ul><ul><ul><li>G = {1, 10, 20, 30, 40, 50, 100, 150, 200} </li></ul></ul><ul><ul><li>I = {1, ..., 40} </li></ul></ul><ul><li>29 nouns -> 120,000 runs </li></ul><ul><li>Upper bound in co-training/self-training performance </li></ul><ul><ul><li>Optimised on test set </li></ul></ul><ul><ul><li>Basic classifier: 53.84% </li></ul></ul><ul><ul><li>Optimal self-training: 65.61% </li></ul></ul><ul><ul><li>Optimal co-training: 65.75% </li></ul></ul><ul><ul><li>~ 25% error reduction </li></ul></ul><ul><li>Per-word parameter setting: </li></ul><ul><ul><li>Co-training = 51.73% </li></ul></ul><ul><ul><li>Self-training = 52.88% </li></ul></ul><ul><li>Global parameter setting </li></ul><ul><ul><li>Co-training = 55.67% </li></ul></ul><ul><ul><li>Self-training = 54.16% </li></ul></ul><ul><li>Example: lady </li></ul><ul><ul><li>basic = 61.53% </li></ul></ul><ul><ul><li>self-training = 84.61% [20/100/39] </li></ul></ul><ul><ul><li>co-training = 82.05% [1/1000/3] </li></ul></ul>
    120. 120. Yarowsky Algorithm <ul><li>(Yarowsky 1995) </li></ul><ul><li>Similar to co-training </li></ul><ul><li>Differs in the basic assumption (Abney 2002) </li></ul><ul><ul><li>“view independence” (co-training) vs. “precision independence” (Yarowsky algorithm) </li></ul></ul><ul><li>Relies on two heuristics and a decision list </li></ul><ul><ul><li>One sense per collocation : </li></ul></ul><ul><ul><ul><li>Nearby words provide strong and consistent clues as to the sense of a target word </li></ul></ul></ul><ul><ul><li>One sense per discourse : </li></ul></ul><ul><ul><ul><li>The sense of a target word is highly consistent within a single document </li></ul></ul></ul>
    121. 121. Learning Algorithm <ul><li>A decision list is used to classify instances of target word : </li></ul><ul><ul><li>“ the loss of animal and plant species through extinction …” </li></ul></ul><ul><li>Classification is based on the highest ranking rule that matches the target context </li></ul>… ... ... <ul><li>A (living) </li></ul>plant species 9.02 <ul><li>A (living) </li></ul>fruit (within +/- k words) 9.03 <ul><li>B (factory) </li></ul>job (within +/- k words) 9.24 <ul><li>A (living) </li></ul>flower (within +/- k words) 9.31 … … … Sense Collocation LogL
    122. 122. Bootstrapping Algorithm <ul><li>All occurrences of the target word are identified </li></ul><ul><li>A small training set of seed data is tagged with word sense </li></ul>Sense-B: factory Sense-A: life
    123. 123. Bootstrapping Algorithm <ul><li>Seed set grows and residual set shrinks …. </li></ul>
    124. 124. Bootstrapping Algorithm <ul><li>Convergence: Stop when residual set stabilizes </li></ul>
    125. 125. Bootstrapping Algorithm <ul><li>Iterative procedure: </li></ul><ul><ul><li>Train decision list algorithm on seed set </li></ul></ul><ul><ul><li>Classify residual data with decision list </li></ul></ul><ul><ul><li>Create new seed set by identifying samples that are tagged with a probability above a certain threshold </li></ul></ul><ul><ul><li>Retrain classifier on new seed set </li></ul></ul><ul><li>Selecting training seeds </li></ul><ul><ul><li>Initial training set should accurately distinguish among possible senses </li></ul></ul><ul><ul><li>Strategies: </li></ul></ul><ul><ul><ul><li>Select a single, defining seed collocation for each possible sense. </li></ul></ul></ul><ul><ul><ul><li>Ex: “ life ” and “ manufacturing ” for target plant </li></ul></ul></ul><ul><ul><ul><li>Use words from dictionary definitions </li></ul></ul></ul><ul><ul><ul><li>Hand-label most frequent collocates </li></ul></ul></ul>
    126. 126. Evaluation <ul><li>Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) </li></ul><ul><li>Performance of multiple models compared with: </li></ul><ul><ul><li>supervised decision lists </li></ul></ul><ul><ul><li>unsupervised learning algorithm of Sch ü tze (1992), based on alignment of clusters with word senses </li></ul></ul>Unsupervised Bootstrapping 96.5 92.2 96.1 - Avg. … - … … … Supervised 97.9 92 98.0 legal/physical motion 96.5 95 97.1 vehicle/container tank 93.6 90 93.9 volume/outer space 98.6 92 97.7 living/factory plant Unsupervised Sch ü tze Senses Word
    127. 127. Outline <ul><li>Task definition </li></ul><ul><ul><li>What does “minimally” supervised mean? </li></ul></ul><ul><li>Bootstrapping algorithms </li></ul><ul><ul><li>Co-training </li></ul></ul><ul><ul><li>Self-training </li></ul></ul><ul><ul><li>Yarowsky algorithm </li></ul></ul><ul><li>Using the Web for Word Sense Disambiguation </li></ul><ul><ul><li>Web as a corpus </li></ul></ul><ul><ul><li>Web as collective mind </li></ul></ul>
    128. 128. The Web as a Corpus <ul><li>Use the Web as a large textual corpus </li></ul><ul><ul><li>Build annotated corpora using monosemous relatives </li></ul></ul><ul><ul><li>Bootstrap annotated corpora starting with few seeds </li></ul></ul><ul><ul><ul><li>Similar to (Yarowsky 1995) </li></ul></ul></ul><ul><li>Use the (semi)automatically tagged data to train WSD classifiers </li></ul>
    129. 129. Monosemous Relatives <ul><li>Idea : determine a phrase (SP) which uniquely identifies the sense of a word (W#i) </li></ul><ul><ul><li>Determine one or more Search Phrases from a machine readable dictionary using several heuristics </li></ul></ul><ul><ul><li>Search the Web using the Search Phrases from step 1. </li></ul></ul><ul><ul><li>Replace the Search Phrases in the examples gathered at 2 with W#i. </li></ul></ul><ul><li>Output: sense annotated corpus for the word sense W#i </li></ul>As a pastime , she enjoyed reading. Evaluate the interestingness of the website. As an interest , she enjoyed reading. Evaluate the interest of the website.
    130. 130. Heuristics to Identify Monosemous Relatives <ul><li>Synonyms </li></ul><ul><ul><li>Determine a monosemous synonym </li></ul></ul><ul><ul><li>remember#1 has recollect as monosemous synonym  SP=recollect </li></ul></ul><ul><li>Dictionary definitions (1) </li></ul><ul><ul><li>Parse the gloss and determine the set of single phrase definitions </li></ul></ul><ul><ul><li>produce#5 has the definition “bring onto the market or release”  2 definitions: “bring onto the market” and “release” </li></ul></ul><ul><ul><li>eliminate “release” as being ambiguous  SP=bring onto the market </li></ul></ul><ul><li>Dictionary defintions (2) </li></ul><ul><ul><li>Parse the gloss and determine the set of single phrase definitions </li></ul></ul><ul><ul><li>Replace the stop words with the NEAR operator </li></ul></ul><ul><ul><li>Strengthen the query: concatenate the words from the current synset using the AND operator </li></ul></ul><ul><ul><li>produce#6 has the synset {grow, raise, farm, produce} and the definition “cultivate by growing”  </li></ul></ul><ul><ul><li>SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce) </li></ul></ul>
    131. 131. <ul><li>Dictionary definitions (3) </li></ul><ul><ul><li>Parse the gloss and determine the set of single phrase definitions </li></ul></ul><ul><ul><li>Keep only the head phrase </li></ul></ul><ul><ul><li>Strengthen the query: concatenate the words from the current synset using the AND operator </li></ul></ul><ul><ul><li>company#5 has the synset {party,company} and the definition “band of people associated in some activity”  </li></ul></ul><ul><ul><li>SP=band of people AND (company OR party) </li></ul></ul>Heuristics to Identify Monosemous Relatives
    132. 132. Example <ul><li>Building annotated corpora for the noun interest . </li></ul>
    133. 133. Example <ul><li>Gather 5,404 examples </li></ul><ul><li>Check the first 70 examples  67 correct; 95.7% accuracy. </li></ul><ul><li>1. I appreciate the genuine interest#1 which motivated you to write your message. </li></ul><ul><li>2. The webmaster of this site warrants neither accuracy, nor interest#2 . </li></ul><ul><li>3. He forgives us not only for our interest#3 , but for his own. </li></ul><ul><li>4. Interest#4 coverage, including rents, was 3.6x </li></ul><ul><li>5. As an interest#5 , she enjoyed gardening and taking part into church activities. </li></ul><ul><li>6. Voted on issues, they should have abstained because of direct and indirect personal interests#6 in the matters of hand. </li></ul><ul><li>7. The Adam Smith Society is a new interest#7 organized within the APA. </li></ul>
    134. 134. Experimental Evaluation <ul><li>Tests on 20 words </li></ul><ul><ul><li>7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings) </li></ul></ul><ul><ul><li>manually check the first 10 examples of each sense of a word </li></ul></ul><ul><ul><li>=> 91% accuracy (Mihalcea 1999) </li></ul></ul>
    135. 135. Web-based Bootstrapping <ul><li>Similar to Yarowsky algorithm </li></ul><ul><li>Relies on data gathered from the Web </li></ul><ul><li>1. Create a set of seeds (phrases) consisting of: </li></ul><ul><ul><li>Sense tagged examples in SemCor </li></ul></ul><ul><ul><li>Sense tagged examples from WordNet </li></ul></ul><ul><ul><li>Additional sense tagged examples, if available </li></ul></ul><ul><li>Phrase? </li></ul><ul><ul><li>At least two open class words; </li></ul></ul><ul><ul><li>Words involved in a semantic relation (e.g. noun phrase, verb-object, verb-subject, etc.) </li></ul></ul><ul><li>2. Search the Web using queries formed with the seed expressions found at Step 1 </li></ul><ul><ul><li>Add to the generated corpus of maximum of N text passages </li></ul></ul><ul><ul><li>Results competitive with manually tagged corpora (Mihalcea 2002) </li></ul></ul>
    136. 136. The Web as Collective Mind <ul><li>Two different views of the Web: </li></ul><ul><ul><li>collection of Web pages </li></ul></ul><ul><ul><li>very large group of Web users </li></ul></ul><ul><li>Millions of Web users can contribute their knowledge to a data repository </li></ul><ul><li>Open Mind Word Expert (Chklovski and Mihalcea, 2002) </li></ul><ul><li>Fast growing rate: </li></ul><ul><ul><li>Started in April 2002 </li></ul></ul><ul><ul><li>Currently more than 100,000 examples of noun senses in several languages </li></ul></ul>
    137. 137. OMWE online
    138. 138. Open Mind Word Expert: Quantity and Quality <ul><li>Data </li></ul><ul><ul><li>A mix of different corpora: Treebank, Open Mind Common Sense, Los Angeles Times, British National Corpus </li></ul></ul><ul><li>Word senses </li></ul><ul><ul><li>Based on WordNet definitions </li></ul></ul><ul><li>Active learning to select the most informative examples for learning </li></ul><ul><ul><li>Use two classifiers trained on existing annotated data </li></ul></ul><ul><ul><li>Select items where the two classifiers disagree for human annotation </li></ul></ul><ul><li>Quality: </li></ul><ul><ul><li>Two tags per item </li></ul></ul><ul><ul><li>One tag per item per contributor </li></ul></ul><ul><li>Evaluations: </li></ul><ul><ul><li>Agreement rates of about 65% - comparable to the agreements rates obtained when collecting data for Senseval-2 with trained lexicographers </li></ul></ul><ul><ul><li>Replicability: tests on 1,600 examples of “interest” led to 90%+ replicability </li></ul></ul>
    139. 139. References <ul><li>(Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002. </li></ul><ul><li>(Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training . Proceedings of COLT 1998. </li></ul><ul><li>(Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert . Proceedings of ACL 2002 workshop on WSD. </li></ul><ul><li>(Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003. </li></ul><ul><li>(Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora . Proceedings of AAAI 1999. </li></ul><ul><li>(Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora . Proceedings of LREC 2002. </li></ul><ul><li>(Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004. </li></ul><ul><li>(Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003. </li></ul><ul><li>(Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000. </li></ul><ul><li>(Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing . Proceedings of NAACL 2001. </li></ul><ul><li>(Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods . Proceedings of ACL 1995. </li></ul>
    140. 140. Part 6: Unsupervised Methods of Word Sense Discrimination
    141. 141. Outline <ul><li>What is Unsupervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Agglomerative Clustering </li></ul><ul><li>LSI/LSA </li></ul><ul><li>Sense Discrimination Using Parallel Texts </li></ul>
    142. 142. What is Unsupervised Learning? <ul><li>Unsupervised learning identifies patterns in a large sample of data, without the benefit of any manually labeled examples or external knowledge sources </li></ul><ul><li>These patterns are used to divide the data into clusters, where each member of a cluster has more in common with the other members of its own cluster than any other </li></ul><ul><li>Note! If you remove manual labels from supervised data and cluster, you may not discover the same classes as in supervised learning </li></ul><ul><ul><li>Supervised Classification identifies features that trigger a sense tag </li></ul></ul><ul><ul><li>Unsupervised Clustering finds similarity between contexts </li></ul></ul>
    143. 143. Cluster this Data! Facts about my day… YES NO NO 4 NO NO NO 3 YES NO YES 2 NO NO YES 1 F3 Ate Well? F2 Slept Well? F1 Hot Outside? Day
    144. 144. Cluster this Data! Facts about my day… YES NO NO 4 NO NO NO 3 YES NO YES 2 NO NO YES 1 F3 Ate Well? F2 Slept Well? F1 Hot Outside? Day
    145. 145. Cluster this Data! YES NO NO 4 NO NO NO 3 YES NO YES 2 NO NO YES 1 F3 Ate Well? F2 Slept Well? F1 Hot Outside? Day
    146. 146. Outline <ul><li>What is Unsupervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Agglomerative Clustering </li></ul><ul><li>LSI/LSA </li></ul><ul><li>Sense Discrimination Using Parallel Texts </li></ul>
    147. 147. Task Definition <ul><li>Unsupervised Word Sense Discrimination: A class of methods that cluster words based on similarity of context </li></ul><ul><li>Strong Contextual Hypothesis </li></ul><ul><ul><li>(Miller and Charles, 1991): Words with similar meanings tend to occur in similar contexts </li></ul></ul><ul><ul><li>(Firth, 1957): “You shall know a word by the company it keeps.” </li></ul></ul><ul><ul><ul><li>…words that keep the same company tend to have similar meanings </li></ul></ul></ul><ul><li>Only use the information available in raw text, do not use outside knowledge sources or manual annotations </li></ul><ul><li>No knowledge of existing sense inventories, so clusters are not labeled with senses </li></ul>
    148. 148. Task Definition <ul><li>Resources: </li></ul><ul><ul><li>Large Corpora </li></ul></ul><ul><li>Scope: </li></ul><ul><ul><li>Typically one targeted word per context to be discriminated </li></ul></ul><ul><ul><li>Equivalently, measure similarity among contexts </li></ul></ul><ul><ul><li>Features may be identified in separate “training” data, or in the data to be clustered </li></ul></ul><ul><ul><li>Does not assign senses or labels to clusters </li></ul></ul><ul><li>Word Sense Discrimination reduces to the problem of finding the targeted words that occur in the most similar contexts and placing them in a cluster </li></ul>
    149. 149. Outline <ul><li>What is Unsupervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Agglomerative Clustering </li></ul><ul><li>LSI/LSA </li></ul><ul><li>Sense Discrimination Using Parallel Texts </li></ul>
    150. 150. Agglomerative Clustering <ul><li>Create a similarity matrix of instances to be discriminated </li></ul><ul><ul><li>Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances </li></ul></ul><ul><ul><li>Typically a first order representation, where similarity is based on the features observed in the pair of instances </li></ul></ul><ul><li>Apply Agglomerative Clustering algorithm to matrix </li></ul><ul><ul><li>To start, each instance is its own cluster </li></ul></ul><ul><ul><li>Form a cluster from the most similar pair of instances </li></ul></ul><ul><ul><li>Repeat until the desired number of clusters is obtained </li></ul></ul><ul><li>Advantages : high quality clustering </li></ul><ul><li>Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons </li></ul>
    151. 151. Measuring Similarity <ul><li>Integer Values </li></ul><ul><ul><li>Matching Coefficient </li></ul></ul><ul><ul><li>Jaccard Coefficient </li></ul></ul><ul><ul><li>Dice Coefficient </li></ul></ul><ul><li>Real Values </li></ul><ul><ul><li>Cosine </li></ul></ul>
    152. 152. Instances to be Clustered N N N Y det verb adj det S3 Y N Y N det prep det S2 N N N N noun noun noun det S4 N Y N Y det prep det adv S1 interest river check fish P+2 P+1 P-1 P-2 1 2 4 S3 1 2 4 S3 0 2 S4 0 3 S2 2 3 S1 S4 S2 S1
    153. 153. Average Link Clustering aka McQuitty’s Similarity Analysis 1 2 4 S3 1 2 4 S3 0 2 S4 0 3 S2 2 3 S1 S4 S2 S1 0 S4 0 S2 S1S3 S4 S2 S1S3 S4 S1S3S2 S4 S1S3S2
    154. 154. Evaluation of Unsupervised Methods <ul><li>If Sense tagged text is available, can be used for evaluation </li></ul><ul><ul><li>But don’t use sense tags for clustering or feature selection! </li></ul></ul><ul><li>Assume that sense tags represent “true” clusters, and compare these to discovered clusters </li></ul><ul><ul><li>Find mapping of clusters to senses that attains maximum accuracy </li></ul></ul><ul><li>Pseudo words are especially useful, since it is hard to find data that is discriminated </li></ul><ul><ul><li>Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier </li></ul>
    155. 155. Baseline Performance <ul><li>(0+0+55)/170 = .32 (0+0+80)/170 = .47 if C3 is S3 if C3 is S1 </li></ul>170 55 35 80 Totals 170 55 35 80 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S3 S2 S1 170 80 35 55 Totals 170 80 35 55 C3 0 0 0 0 C2 0 0 0 0 C1 Totals S1 S2 S3
    156. 156. Evaluation <ul><li>Suppose that C1 is labeled S1, C2 as S2, and C3 as S3 </li></ul><ul><li>Accuracy = (10 + 0 + 10) / 170 = 12% </li></ul><ul><li>Diagonal shows how many members of the cluster actually belong to the sense given on the column </li></ul><ul><li>Can the “columns” be rearranged to improve the overall accuracy? </li></ul><ul><ul><li>Optimally assign clusters to senses </li></ul></ul>170 55 35 80 Totals 65 10 5 50 C3 60 40 0 20 C2 45 5 30 10 C1 Totals S3 S2 S1
    157. 157. Evaluation <ul><li>The assignment of C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71% </li></ul><ul><li>Find the ordering of the columns in the matrix that maximizes the sum of the diagonal. </li></ul><ul><li>This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory. </li></ul>170 80 55 35 Totals 65 50 10 5 C3 60 20 40 0 C2 45 10 5 30 C1 Totals S1 S3 S2
    158. 158. Agglomerative Approach <ul><li>(Pedersen and Bruce, 1997) explore discrimination with a small number (approx 30) of features near target word. </li></ul><ul><ul><li>Morphological form of target word (1) </li></ul></ul><ul><ul><li>Part of Speech two words to left and right of target word (4) </li></ul></ul><ul><ul><li>Co-occurrences (3) most frequent content words in context </li></ul></ul><ul><ul><li>Unrestricted collocations (19) most frequent words located one position to left or right of target, OR </li></ul></ul><ul><ul><li>Content collocations (19) most frequent content words located one position to left or right of target </li></ul></ul><ul><li>Features identified in the instances be clustered </li></ul><ul><li>Similarity measured by matching coefficient </li></ul><ul><li>Clustered with McQuitty’s Similarity Analysis, Ward’s Method, and the EM Algorithm </li></ul><ul><ul><li>Found that McQuitty’s method was the most accurate </li></ul></ul>
    159. 159. Experimental Evaluation <ul><li>Adjectives </li></ul><ul><ul><li>Chief, 86% majority (1048) </li></ul></ul><ul><ul><li>Common, 84% majority (1060) </li></ul></ul><ul><ul><li>Last, 94% majority (3004) </li></ul></ul><ul><ul><li>Public, 68% majority (715) </li></ul></ul><ul><li>Nouns </li></ul><ul><ul><li>Bill, 68% majority (1341) </li></ul></ul><ul><ul><li>Concern, 64% majority (1235) </li></ul></ul><ul><ul><li>Drug, 57% majority (1127) </li></ul></ul><ul><ul><li>Interest, 59% majority (2113) </li></ul></ul><ul><ul><li>Line, 37% majority (1149) </li></ul></ul><ul><li>Verbs </li></ul><ul><ul><li>Agree, 74% majority (1109) </li></ul></ul><ul><ul><li>Close, 77% majority (1354) </li></ul></ul><ul><ul><li>Help, 78% majority (1267) </li></ul></ul><ul><ul><li>Include, 91% majority (1526) </li></ul></ul><ul><li>Adjectives </li></ul><ul><ul><li>Chief, 86% </li></ul></ul><ul><ul><li>Common, 80% </li></ul></ul><ul><ul><li>Last, 79% </li></ul></ul><ul><ul><li>Public, 63% </li></ul></ul><ul><li>Nouns </li></ul><ul><ul><li>Bill, 75% </li></ul></ul><ul><ul><li>Concern, 68% </li></ul></ul><ul><ul><li>Drug, 65% </li></ul></ul><ul><ul><li>Interest, 65% </li></ul></ul><ul><ul><li>Line, 42% </li></ul></ul><ul><li>Verbs </li></ul><ul><ul><li>Agree, 69% </li></ul></ul><ul><ul><li>Close, 72% </li></ul></ul><ul><ul><li>Help, 70% </li></ul></ul><ul><ul><li>Include, 77% </li></ul></ul>
    160. 160. Analysis <ul><li>Unsupervised methods may not discover clusters equivalent to the classes learned in supervised learning </li></ul><ul><li>Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives? </li></ul><ul><ul><li>Humans could look at the members of each cluster and determine the nature of the relationship or meaning that they all share </li></ul></ul><ul><ul><li>Use the contents of the cluster to generate a descriptive label that could be inspected by a human </li></ul></ul><ul><li>First order feature sets may be problematic with smaller amounts of data since these features must occur exactly in the test instances in order to be “matched” </li></ul>
    161. 161. Outline <ul><li>What is Unsupervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Agglomerative Clustering </li></ul><ul><li>LSI/LSA </li></ul><ul><li>Sense Discrimination Using Parallel Texts </li></ul>
    162. 162. Latent Semantic Indexing/Analysis <ul><li>Adapted by (Sch ü tze, 1998) to word sense discrimination </li></ul><ul><li>Represent training data as word co-occurrence matrix </li></ul><ul><li>Reduce the dimensionality of the co-occurrence matrix via Singular Value Decomposition (SVD) </li></ul><ul><ul><li>Significant dimensions are associated with concepts </li></ul></ul><ul><li>Represent the instances of a target word to be clustered by taking the average of all the vectors associated with all the words in that context </li></ul><ul><ul><li>Context represented by an averaged vector </li></ul></ul><ul><li>Measure the similarity amongst instances via cosine and record in similarity matrix, or cluster the vectors directly </li></ul>
    163. 163. Co-occurrence matrix 4 2 0 0 0 3 0 1 box 0 1 2 2 1 2 0 0 memory 0 0 0 1 0 0 2 0 organ 0 2 0 3 2 0 0 0 debt 0 1 0 3 1 0 0 2 linux 0 1 0 3 2 0 0 0 sales 3 0 2 2 0 3 0 0 lab 1 0 2 0 0 1 2 0 petri 0 1 0 0 2 0 0 1 disk 1 0 2 0 0 0 3 0 body 0 0 0 3 1 0 0 2 pc plasma graphics tissue data ibm cells blood apple
    164. 164. Singular Value Decomposition A=UDV’
    165. 165. U -.52 .39 -.48 .02 .09 .41 -.09 .40 -.30 .08 .31 .43 -.26 -.39 -.6 .20 .00 -.00 -.00 -.02 -.01 .00 -.02 -.00 -.07 -.3 .14 -.49 -.07 .30 .25 .56 -.01 .08 .05 -.01 .24 -.08 .11 .46 .08 .03 -.04 .72 .09 -.31 -.01 .37 -.07 .01 -.21 -.31 -.34 -.45 -.68 .29 .00 .05 .83 .17 -.02 .25 -.45 .08 .03 .20 -.22 .31 -.60 .39 .13 .35 -.01 -.04 -.44 .08 .44 .59 -.49 .05 -.02 .63 .02 -.09 .52 -.2 .09 .35
    166. 166. D 0.00 0.00 0.00 0.66 1.26 2.30 2.52 3.25 3.99 6.36 9.19
    167. 167. V -.20 .22 -.07 -.10 -.87 -.07 -.06 .17 .19 -.26 .04 .03 .17 -.32 .02 .13 -.26 -.17 .06 -.04 .86 .50 -.58 .12 .09 -.18 -.27 -.18 -.12 -.47 .11 -.03 .12 .31 -.32 -.04 .64 -.45 -.14 -.23 .28 .07 -.23 -.62 -.59 .05 .02 -.12 .15 .11 .25 -.71 -.31 -.04 .08 .29 -.05 .05 .20 -.51 .09 -.03 .12 .31 -.01 .02 -.45 -.32 .50 .27 .49 -.02 .08 .21 -.06 .08 -.09 .52 -.45 -.01 .63 .03 -.12 -.31 .71 -.13 .39 -.12 .12 .15 .37 .07 .58 -.41 .15 .17 -.30 -.32 -.27 -.39 .11 .44 .25 .03 -.02 .26 .23 .39 .57 -.37 .04 .03 -.12 -.31 -.05 -.05 .04 .28 -.04 .08 .21
    168. 168. Co-occurrence matrix after SVD 1.1 1.0 .98 1.7 .86 .72 .85 .77 memory .00 .00 .17 1.2 .77 .00 .84 .00 organ .00 1.5 .00 3.2 2.1 .00 .00 1.2 debt .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .41 .85 .35 2.2 1.3 .39 .15 .73 sales 2.3 .18 2.5 1.7 .35 2.0 1.7 .21 lab 1.4 .00 1.5 .49 .00 1.2 1.1 .00 germ .00 .91 .00 2.1 1.3 .01 .00 .76 disk 1.5 .00 1.6 .33 .00 1.3 1.2 .00 body .09 .86 .01 2.0 1.3 .11 .00 .73 pc plasma graphics tissue data ibm cells blood apple
    169. 169. Effect of SVD <ul><li>SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space </li></ul><ul><ul><li>If “dog” and “collie” and “wolf” are dimensions/columns in the word co-occurrence matrix, after SVD they may be a single dimension that represents “canines” </li></ul></ul><ul><li>The dimensions are the principle components that may (hopefully) represent the meaning of concepts </li></ul><ul><li>SVD has effect of smoothing a very sparse matrix, so that there are very few 0 valued cells </li></ul>
    170. 170. Context Representation <ul><li>Represent each instance of the target word to be clustered by averaging the word vectors associated with its context </li></ul><ul><ul><li>This creates a “second order” representation of the context </li></ul></ul><ul><li>The context is represented not only by the words that occur therein, but also the words that occur with the words in the context elsewhere in the training corpus </li></ul>
    171. 171. Second Order Context Representation <ul><li>These two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” </li></ul><ul><li>The two contexts are similar because they share many second order co-occurrences </li></ul><ul><li>I got a new disk today! </li></ul><ul><li>What do you think of linux? </li></ul>1.0 .72 memory .00 .00 organ .13 1.1 .03 2.7 1.7 .16 .00 .96 linux .00 .91 .00 2.1 1.3 .01 .00 .76 disk Plasma graphics tissue data ibm cells blood apple
    172. 172. Second Order Context Representation <ul><li>The bank of the Mississippi River was washed away. </li></ul>
    173. 173. First vs. Second Order Representations <ul><li>Comparison made by (Purandare and Pedersen, 2004) </li></ul><ul><li>Build word co-occurrence matrix using log-likelihood ratio </li></ul><ul><ul><li>Reduce via SVD </li></ul></ul><ul><ul><li>Cluster in vector or similarity space </li></ul></ul><ul><ul><li>Evaluate relative to manually created sense tags </li></ul></ul><ul><li>Experiments conducted with Senseval-2 data </li></ul><ul><ul><li>24 words, 50-200 training and test examples </li></ul></ul><ul><ul><li>Second order representation resulted in significantly better performance than first order, probably due to modest size of data. </li></ul></ul><ul><li>Experiments conducted with line, hard, serve </li></ul><ul><ul><li>4000-5000 instances, divided into 60-40 training-test split </li></ul></ul><ul><ul><li>First order representation resulted in better performance than second order, probably due to larger amount of data </li></ul></ul>
    174. 174. Analysis <ul><li>Agglomerative methods based on direct (first order) features require large amounts of data in order to obtain a reliable set of features </li></ul><ul><li>Large amounts of data are problematic for agglomerative clustering (due to exhaustive comparisons) </li></ul><ul><li>Second order representations allow you to make due with smaller amounts of data, and still get a rich (non-sparse) representation of context </li></ul><ul><li> is a complete system for performing unsupervised discrimination using first or second order context vectors in similarity or vector space, and includes support for SVD, clustering and evaluation </li></ul>
    175. 175. Outline <ul><li>What is Unsupervised Learning? </li></ul><ul><li>Task Definition </li></ul><ul><li>Agglomerative Clustering </li></ul><ul><li>LSI/LSA </li></ul><ul><li>Sense Discrimination Using Parallel Texts </li></ul>
    176. 176. Sense Discrimination Using Parallel Texts <ul><li>There is controversy as to what exactly is a “word sense” (e.g., Kilgarriff, 1997) </li></ul><ul><li>It is sometimes unclear how fine grained sense distinctions need to be to be useful in practice. </li></ul><ul><li>Parallel text may present a solution to both problems! </li></ul><ul><ul><li>Text in one language and its translation into another </li></ul></ul><ul><li>Resnik and Yarowsky (1997) suggest that word sense disambiguation concern itself with sense distinctions that manifest themselves across languages. </li></ul><ul><ul><li>A “bill” in English may be a “pico” (bird jaw) in or a “cuenta” (invoice) in Spanish. </li></ul></ul>
    177. 177. Parallel Text <ul><li>Parallel Text can be found on the Web and there are several large corpora available (e.g., UN Parallel Text, Canadian Hansards) </li></ul><ul><li>Manual annotation of sense tags is not required! However, text must be word aligned (translations identified between the two languages). </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>Workshop on Parallel Text, NAACL 2003 </li></ul></ul><ul><li>Given word aligned parallel text, sense distinctions can be discovered. (e.g., Li and and Li, 2002, Diab, 2002) </li></ul>
    178. 178. References <ul><li>(Diab, 2002) Diab, Mona and Philip Resnik, An Unsupervised Method for Word Sense Tagging using Parallel Corpora, Proceedings of ACL, 2002. </li></ul><ul><li>(Firth, 1957) A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, Oxford University Press, Oxford. </li></ul><ul><li>(Kilgarriff, 1997) “I don’t believe in word senses”, Computers and the Humanities (31) pp. 91-113. </li></ul><ul><li>(Li and Li, 2002) Word Translation Disambiguation Using Bilingual Bootstrapping. Proceedings of ACL. Pp. 343-351. </li></ul><ul><li>(McQuitty, 1966) Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement (26) pp. 825-831. </li></ul><ul><li>(Miller and Charles, 1991) Contextual correlates of semantic similarity. Language and Cognitive Processes, 6 (1) pp. 1 - 28. </li></ul><ul><li>(Pedersen and Bruce, 1997) Distinguishing Word Sense in Untagged Text. In Proceedings of EMNLP2. pp 197-207. </li></ul><ul><li>(Purandare and Pedersen, 2004) Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. Proceedings of the Conference on Natural Language and Learning. pp. 41-48. </li></ul><ul><li>(Resnik and Yarowsky, 1997) A Perspective on Word Sense Disambiguation Methods and their Evaluation. The ACL-SIGLEX Workshop Tagging Text with Lexical Semantics. pp. 79-86. </li></ul><ul><li>(Schutze, 1998) Automatic Word Sense Discrimination. Computational Linguistics, 24 (1) pp. 97-123. </li></ul>
    179. 179. Part 7: How to Get Started in Word Sense Disambiguation Research
    180. 180. Outline <ul><li>Where to get the required ingredients? </li></ul><ul><ul><li>Machine Readable Dictionaries </li></ul></ul><ul><ul><li>Machine Learning Algorithms </li></ul></ul><ul><ul><li>Sense Annotated Data </li></ul></ul><ul><ul><li>Raw Data </li></ul></ul><ul><li>Where to get WSD software? </li></ul><ul><li>How to get your algorithms tested? </li></ul><ul><ul><li>Senseval </li></ul></ul>
    181. 181. Machine Readable Dictionaries <ul><li>Machine Readable format (MRD) </li></ul><ul><ul><li>Oxford English Dictionary </li></ul></ul><ul><ul><li>Collins </li></ul></ul><ul><ul><li>Longman Dictionary of Ordinary Contemporary English (LDOCE) </li></ul></ul><ul><li>Thesauri – add synonymy information </li></ul><ul><ul><li>Roget Thesaurus </li></ul></ul><ul><li>Semantic networks – add more semantic relations </li></ul><ul><ul><li>WordNet </li></ul></ul><ul><ul><ul><li>Dictionary files, source code </li></ul></ul></ul><ul><ul><li>EuroWordNet </li></ul></ul><ul><ul><ul><li>Seven European languages </li></ul></ul></ul>
    182. 182. Machine Learning Algorithms <ul><li>Many implementations av