Measuring Similarity Between Contexts and Concepts


Published on

Invited talk from 2005 given at Yahoo and West Publishing...

Published in: Education, Technology
  • Be the first to comment

Measuring Similarity Between Contexts and Concepts

  1. 1. Measuring Similarity Between Concepts and Contexts Ted Pedersen Department of Computer Science University of Minnesota, Duluth
  2. 2. The problems… <ul><li>Recognize similar (or related) concepts </li></ul><ul><ul><li>frog : amphibian </li></ul></ul><ul><ul><li>Duluth : snow </li></ul></ul><ul><li>Recognize similar contexts </li></ul><ul><ul><li>I bought some food at the store : </li></ul></ul><ul><ul><li>I purchased something to eat at the market </li></ul></ul>
  3. 3. Similarity and Relatedness <ul><li>Two concepts are similar if they are connected by is-a relationships. </li></ul><ul><ul><li>A frog is-a-kind-of amphibian </li></ul></ul><ul><ul><li>An illness is-a heath_condition </li></ul></ul><ul><li>Two concepts can be related many ways… </li></ul><ul><ul><li>A human has-a-part liver </li></ul></ul><ul><ul><li>Duluth receives-a-lot-of snow </li></ul></ul><ul><li>… similarity is one way to be related </li></ul>
  4. 4. The approaches… <ul><li>Measure conceptual similarity using a structured repository of knowledge </li></ul><ul><ul><li>Lexical database WordNet </li></ul></ul><ul><li>Measure contextual similarity using knowledge lean methods that are based on co-occurrence information from large corpora </li></ul>
  5. 5. Why measure conceptual similarity? <ul><li>A word will take the sense that is most related to the surrounding context </li></ul><ul><ul><li>I love Java , especially the beaches and the weather. </li></ul></ul><ul><ul><li>I love Java , especially the support for concurrent programming. </li></ul></ul><ul><ul><li>I love java , especially first thing in the morning with a bagel. </li></ul></ul>
  6. 6. Word Sense Disambiguation <ul><li>… can be performed by finding the sense of a word most related to its neighbors </li></ul><ul><li>Here, we define similarity and relatedness with respect to WordNet </li></ul><ul><ul><li>WordNet::Similarity </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>WordNet::SenseRelate </li></ul><ul><ul><li>AllWords – assign a sense to every content word </li></ul></ul><ul><ul><li>TargetWord – assign a sense to a given word </li></ul></ul><ul><ul><li> </li></ul></ul>
  7. 7. SenseRelate <ul><li>For each sense of a target word in context </li></ul><ul><ul><li>For each content word in the context </li></ul></ul><ul><ul><ul><li>For each sense of that content word </li></ul></ul></ul><ul><ul><ul><ul><li>Measure similarity/relatedness between sense of target word and sense of content word with WordNet::Similarity </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Keep running sum for score of each sense of target </li></ul></ul></ul></ul><ul><li>Pick sense of target word with highest score with words in context </li></ul>
  8. 8. WordNet::Similarity <ul><li>Path based measures </li></ul><ul><ul><li>Shortest path (path) </li></ul></ul><ul><ul><li>Wu & Palmer (wup) </li></ul></ul><ul><ul><li>Leacock & Chodorow (lch) </li></ul></ul><ul><ul><li>Hirst & St-Onge (hso) </li></ul></ul><ul><li>Information content measures </li></ul><ul><ul><li>Resnik (res) </li></ul></ul><ul><ul><li>Jiang & Conrath (jcn) </li></ul></ul><ul><ul><li>Lin (lin) </li></ul></ul><ul><li>Gloss based measures </li></ul><ul><ul><li>Banerjee and Pedersen (lesk) </li></ul></ul><ul><ul><li>Patwardhan and Pedersen (vector, vector_pairs) </li></ul></ul>
  9. 9. watercraft instrumentality object artifact conveyance vehicle motor-vehicle car boat ark article ware table-ware cutlery fork from Jiang and Conrath [1997]
  10. 10. Path Finding <ul><li>Find shortest is-a path between two concepts? </li></ul><ul><ul><li>Rada, et. al. (1989) </li></ul></ul><ul><ul><li>Scaled by depth of hierarchy </li></ul></ul><ul><ul><ul><li>Leacock & Chodorow (1998) </li></ul></ul></ul><ul><ul><li>Depth of subsuming concept scaled by sum of the depths of individual concepts </li></ul></ul><ul><ul><ul><li>Wu and Palmer (1994) </li></ul></ul></ul>
  11. 11. watercraft instrumentality object artifact conveyance vehicle motor-vehicle car boat ark article ware table-ware cutlery fork
  12. 12. Information Content <ul><li>Measure of specificity in is-a hierarchy (Resnik, 1995) </li></ul><ul><ul><li>-log (probability of concept) </li></ul></ul><ul><ul><li>High information content values mean very specific concepts (like pitch-fork and basketball shoe) </li></ul></ul><ul><li>Count how often a concept occurs in a corpus </li></ul><ul><ul><li>Increment the count associated with that concept, and propagate the count up! </li></ul></ul><ul><ul><li>If based on word forms, increment all concepts associated with that form </li></ul></ul>
  13. 13. Observed “car”... motor vehicle (327 +1) *root* (32783 + 1) minicab (6) cab (23) car (73 +1) bus (17) stock car (12)
  14. 14. Observed “stock car”... motor vehicle (328+1) *root* (32784+1) minicab (6) cab (23) car (74+1) bus (17) stock car (12+1)
  15. 15. After Counting Concepts... motor vehicle (329) IC = 1.998 *root* (32785) minicab (6) cab (23) car (75) bus (17) stock car (13) IC = 3.042
  16. 16. Similarity and Information Content <ul><li>Resnik (1995) use information content of least common subsumer to express similarity between two concepts </li></ul><ul><li>Lin (1998) scale information content of least common subsumer with sum of information content of two concepts </li></ul><ul><li>Jiang & Conrath (1997) find difference between least common subsumer’s information content and the sum of the two individual concepts </li></ul>
  17. 17. Why doesn’t this solve problem? <ul><li>Concepts must be organized in a hierarchy, and connected in that hierarchy </li></ul><ul><ul><li>Limited to comparing nouns with nouns, or maybe verbs with verbs </li></ul></ul><ul><ul><li>Limited to similarity measures (is-a) </li></ul></ul><ul><li>What about mixed parts of speech? </li></ul><ul><ul><li>Murder (noun) and horrible (adjective) </li></ul></ul><ul><ul><li>Tobacco (noun) and drinking (verb) </li></ul></ul>
  18. 18. Using Dictionary Glosses to Measure Relatedness <ul><li>Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their definitions </li></ul><ul><ul><li>Cold - a mild viral infection involving the nose and respiratory passages (but not the lungs) </li></ul></ul><ul><ul><li>Flu - an acute febrile highly contagious viral disease </li></ul></ul><ul><li>Adapted Lesk (Banerjee & Pedersen, 2003) – exapand g;losses to include those concepts directly related </li></ul><ul><ul><li>Cold - a common cold affecting the nasal passages and resulting in congestion and sneezing and headache; mild viral infection involving the nose and respiratory passages (but not the lungs); a disease affecting the respiratory system </li></ul></ul><ul><ul><li>Flu - an acute and highly contagious respiratory disease of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious viral disease; a disease that can be communicated from one person to another </li></ul></ul>
  19. 19. Context/Gloss Vectors <ul><li>Leskian approaches require exact matches in glosses </li></ul><ul><ul><li>Glosses are short, may use related but not identical words </li></ul></ul><ul><li>Solution? Expand glosses by replacing each content word with a co-occurrence vector derived from corpora </li></ul><ul><ul><li>Rows are words found in glosses, columns represent their co-occurring words in a corpus, cell values are their log-likelihood </li></ul></ul><ul><li>Average the word vectors to create a single vector that represents the gloss/sense </li></ul><ul><ul><li>Patwardhan & Pedersen, 2003 </li></ul></ul><ul><li>Measure relatedness using cosine rather than exact match! </li></ul>
  20. 20. Gloss/Context Vectors
  21. 21. Experiment <ul><li>Senseval-2 data consists of 73 nouns, verbs, and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples. </li></ul><ul><ul><li>Best supervised system 64% </li></ul></ul><ul><ul><li>SenseRelate 53% (lesk, vector) </li></ul></ul><ul><ul><li>Most frequent sense 48% </li></ul></ul>
  22. 22. Results <ul><li>SenseRelate achieves disambiguation accuracy better than most frequent sense!! </li></ul><ul><ul><li>This is more unusual than you would think. </li></ul></ul><ul><li>Window of context is defined by position, includes 2 content words to both the left and right which are measured against the word being disambiguated. </li></ul><ul><ul><li>Positional proximity is not always associated with semantic similarity. </li></ul></ul>
  23. 23. Why this doesn’t solve the problem.. <ul><li>WordNet </li></ul><ul><ul><li>Nouns – 80,000 concepts </li></ul></ul><ul><ul><li>Verbs – 13,000 concepts </li></ul></ul><ul><ul><li>Adjectives – 18,000 concepts </li></ul></ul><ul><ul><li>Adverbs – 4,000 concepts </li></ul></ul><ul><li>Words not found in WordNet can’t be disambiguated by SenseRelate </li></ul>
  24. 24. Knowledge Lean Methods <ul><li>Can measure similarity between two words by comparing co-occurrence vectors created for each. </li></ul><ul><li>Can measure similarity of two contexts by representing them as 2 nd order co-occurrence vectors and comparing. </li></ul>
  25. 25. Word Sense Discrimination <ul><li>Cluster different senses of words like line or interest based on contextual similarity. </li></ul><ul><ul><li>Pedersen & Bruce, 1997 </li></ul></ul><ul><ul><li>Schutze, 1998 </li></ul></ul><ul><ul><li>Purandare & Pedersen, 2004 </li></ul></ul><ul><li>Hard to evaluate, senses of words are somewhat ill defined, distinctions made by clustering methods may or may not correspond with human intuitions </li></ul><ul><li>http:// </li></ul>
  26. 26. Name Discrimination <ul><li>Names that occur in similar contexts may refer to the same person. </li></ul><ul><ul><li>George Miller is an eminent psychologist. </li></ul></ul><ul><ul><li>George Miller is one of the founders of modern cognitive science. </li></ul></ul><ul><ul><li>George Miller is a member of the US House of Representatives. </li></ul></ul>
  27. 31. Objective <ul><li>Given some number of contexts containing “John Smith”, identify those that are similar to each other </li></ul><ul><li>Group similar contexts together, assume they are associated with single individual </li></ul><ul><li>Generate an identifying label from the content of the different clusters </li></ul>
  28. 32. Similarity of Context? <ul><li>Second order Co-occurrences </li></ul><ul><ul><li>He drives his car fast / Jim speeds in his auto </li></ul></ul><ul><ul><li>Car -> motor, garage, gasoline, insurance </li></ul></ul><ul><ul><li>Auto -> motor, insurance, gasoline, accident </li></ul></ul><ul><ul><li>Car and Auto occur with many of the same words. They are therefore similar! </li></ul></ul><ul><ul><li>Less direct relationship, more resistant to sparsity! </li></ul></ul>
  29. 33. Feature Selection <ul><li>Bigrams – two word sequences that may have one intervening word between them </li></ul><ul><ul><li>Frequency > 1 </li></ul></ul><ul><ul><li>Log-likelihood ratio > 3.841 </li></ul></ul><ul><ul><li>OR stop list </li></ul></ul><ul><li>Must occur within Ft positions of target, Ft typically set to 5 or 20 </li></ul>
  30. 34. Second Order Context Representation <ul><li>Bigrams used to create matrix </li></ul><ul><ul><li>Cell values = log-likelihood of word pair </li></ul></ul><ul><li>Rows are co-occurrence vector for a word </li></ul><ul><li>Represent context by averaging vectors of words in that context </li></ul><ul><ul><li>Context includes the Cxt positions around the target, where Cxt is typically 5 or 20. </li></ul></ul>
  31. 35. 2 nd Order Context Vectors <ul><li>He won an Oscar, but Tom Hanks is still a nice guy. </li></ul>0 6272.85 2.9133 62.6084 20.032 1176.84 51.021 O2 context 0 18818.55 0 0 0 205.5469 134.5102 guy 0 0 0 136.0441 29.576 0 0 Oscar 0 0 8.7399 51.7812 30.520 3324.98 18.5533 won needle family war movie actor football baseball
  32. 36. Limitations of 2 nd order 0 52.27 0 0.92 0 4.21 0 28.72 0 3.24 0 1.28 0 2.53 Weapon Missile Shoot Fire Destroy Murder Kill 17.77 0 14.6 46.2 22.1 0 34.2 19.23 2.36 0 72.7 0 1.28 2.56 Execute Command Bomb Pipe Fire CD Burn
  33. 37. Singular Value Decomposition <ul><li>What it does (for sure): </li></ul><ul><ul><li>Smoothes out zeroes </li></ul></ul><ul><ul><li>Finds Principal Components </li></ul></ul><ul><li>What it might do: </li></ul><ul><ul><li>Capture Polysemy </li></ul></ul><ul><ul><li>Word Space to Semantic Space </li></ul></ul>
  34. 38. After context representation… <ul><li>Second order vector is an average of word vectors that make up context, captures indirect relationships </li></ul><ul><ul><li>Reduced by SVD to principal components </li></ul></ul><ul><li>Now, cluster the vectors! </li></ul><ul><ul><li>We use the method of repeated bisections </li></ul></ul><ul><ul><li>CLUTO </li></ul></ul>
  35. 39. Evaluation (before mapping) c1 c2 c4 c3 2 1 15 2 C4 6 1 1 2 C3 1 7 1 1 C2 2 3 0 10 C1
  36. 40. Evaluation (after mapping) Agreement=38/55=0.69 20 15 2 1 2 C4 17 1 1 0 55 11 12 15 10 6 1 2 C3 10 1 7 1 C2 15 2 3 10 C1
  37. 41. Majority Sense Classifier Maj. =17/55=0.31
  38. 42. Experimental Data <ul><li>Created from AFE GigaWord corpus </li></ul><ul><li>170,969,00 words </li></ul><ul><li>May 1994-May 1997 </li></ul><ul><li>December 2001-June 2002 </li></ul><ul><li>Created name conflated pseudo words </li></ul><ul><ul><li>25 words to left and right of target </li></ul></ul>
  39. 43. Name Conflated Data 51.4% 231,069 JapAnce 112,357 France 118,712 Japan 53.9% 46,431 JorGypt 21,762 Egyptian 25,539 Jordan 56.0% 13,734 MonSlo 6,176 Slobodan Milosovic 7,846 Shimon Peres 58.6% 5,807 MSIIBM 2,406 IBM 3,401 Microsoft 73.7% 4,073 JikRol 1,071 Rolf Ekeus 3,002 Tajik 69.3% 2,452 RoBeck 740 David Beckham 1,652 Ronaldo Maj. Total New Count Name Count Name
  40. 44. 50.3 50.3 51.1 51.1 51.4 231,069 JapAnce 53.0 57.0 59.1 56.6 53.9 46,431 JorGypt 91.4 54.6 96.6 62.8 56.0 13,734 MonSLo 60.0 68.0 51.3 47.7 58.6 5,807 MSIIBM 90.4 91.0 96.2 94.7 73.7 4,073 JikRol 54.7 85.9 72.7 57.3 69.3 2,452 Robeck Ft 20 Ft 5 Ft 20 Ft 5 Maj. # Cxt 20 Cxt 5
  41. 45. Conclusions <ul><li>Tradeoff between size of context and feature selection space </li></ul><ul><ul><li>Context small – Feature large : narrow window around target word where many possible features represented </li></ul></ul><ul><ul><li>Context large – Feature small : large window around target word where a selective set of features represented </li></ul></ul><ul><li>SVD didn’t help/hurt </li></ul><ul><ul><li>Results shown are without SVD </li></ul></ul>
  42. 46. Ongoing work <ul><li>Creating Path Finding Measures of Relatedness </li></ul><ul><li>Stopping Clustering Automatically </li></ul><ul><li>Cluster labeling </li></ul><ul><li>… Bring together finding conceptual similarity and contextual similarity </li></ul>
  43. 47. Thanks to… <ul><li>WordNet::Similarity and SenseRelate </li></ul><ul><li> </li></ul><ul><li>http:// </li></ul><ul><ul><li>Siddharth Patwardhan </li></ul></ul><ul><ul><li>Satanjeev Banerjee </li></ul></ul><ul><ul><li>Jason Michelizzi </li></ul></ul><ul><li>SenseClusters </li></ul><ul><li>http:// </li></ul><ul><ul><li>Anagha Kulkarni </li></ul></ul><ul><ul><li>Amruta Purandare </li></ul></ul>