Information Content Measures of Semantic Similarity
                                                                      ...
Upcoming SlideShare
Loading in …5
×

Pedersen naacl-2010-poster

575 views

Published on

  • Be the first to comment

  • Be the first to like this

Pedersen naacl-2010-poster

  1. 1. Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text Ted Pedersen University of Minnesota, Duluth Abstract Semantic Similarity Information Content Experiments This poster presents an empirical Is a dog similar to a human? The hypothesis underlying this poster is that comparison of similarity measures Is a cow similar to a barn? Information Content values can be reasonably Is a time machine similar to a telescope? calculated without sense tagged text, and that for pairs of concepts based on Is a computer similar to a surf board? sense-tagged text is not available in sufficient Information Content. It shows that quantities to be considered reliable for this task. using modest amounts of untagged Similar things share certain traits : X and Y both A number of experiments were conducted where (have | are made of | like | think about | live in | different amounts of raw text were used to text to derive Information Content studied at | play | hate | are a kind of | eat ) Z estimate Information Content values and to use results in higher correlation with those in computing semantic similarity. human similarity judgments than Aristotle, 384 BC – 322 BC ● Great Chain of Being - nature with ranks The automatically computed semantic similarity using the largest available corpus ● Law of Similarity values were measured for rank correlation with of manually annotated sense- three manually created gold standard datasets. tagged data. Carl Linnaeus, 1707-1778 This poster shows the results using the Miller ● Nature as nested hierarchy and Charles set of 30 word pairs. ● Binomial Nomenclature : Genus + Species Overall there was no advantage to using sense This is a small portion of WordNet, showing tagged text to compute Information Content This leads to TAXONOMY – concepts organized values, and the Information Content based in a hierarchy joined by “is-a” relations Information Content for concepts based on measures of semantic similarity improved upon sense-tagged text (SemCor), newspaper text the path based measures. NLP needs to incorporate similarity for WSD, (AFE), and by assuming each sense occurs one lexical selection in text generation, semantic time (ADD-1). If the text is not sense tagged, then Of particular interest was the fact that the add-1 search, recognizing textual entailment... each possible sense of a word (and its ancestors) smoothing method (without any text!) provided is incremented with each occurrence. results almost as good as raw newspaper text. This shows that the real power of Information Measures of Semantic Similarity Content is not in the frequency counts but in the structure of the taxonomy. This would have come A measure of semantic similarity quantifies the Correlation with Miller & Charles Data as no surprise to Aristotle and Linnaeus degree to which two concepts are “alike” ● How much does c1 remind you of c2? Concept Pairs RES LIN JCN Corpus for Size Coverage JCN LIN RES Similarity Measures PATH SEMCOR AFE ADD- SEMCOR AFE ADD-1 SEMCOR AFE ADD-1 Concepts can be arranged in a hierarchy, where IC w/ Different Sources 1 children “are of a kind of” their parents – path SemCor 226,000 .24 .72 .73 .74 telescope#n#1 microscope#n#1 0.33 9.28 10.81 0.93 0.93 0.89 0.91 0.66 0.38 0.66 lengths give plausible approximation of similarity (sense-tagged) instrument#n#1 0.33 4.37 4.66 0.70 0.70 0.69 0.65 0.26 0.24 0.26 machine#n#1 SemCor 670,000 .37 .82 .79 .76 telescope#n#1 0.17 4.37 4.66 0.41 0.41 0.34 0.36 0.08 0.06 0.08 ...but path lengths are inconsistent as you move (raw) time_machine#n#1 to more general or more specific concepts xie199501 1.2 million .35 .78 .75 .73 catapult#n#3 time_machine#n#1 0.14 4.37 4.66 0 0 0.29 0.31 0 0.04 0.06 (Jan 1995) Resnik (1995) assigned a value of specificity to xie199501- 2.3 million .39 .79 .75 .73 each concept (which corrects for path length) 02 ● Information Content (c1) = log prob (c1) xie199501- 16 million .51 .87 .81 .75 References ● Similarity of c1 and c2 is the IC of their least 12 Measures of Semantic Similarity common subsumer (most specific ancestor) xie1995- 73 million .60 .89 .81 .75 J. Jiang and D. Conrath, 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings 1998 on International Conference on Research in Computational Linguistics, Taiwan. Resnik (1995) XIE 133 million .64 .88 .81 .76 D. Lin, 1998. An information-theoretic definition of similarity, Proceedings of the International Conference on Machine Learning, Madison. res (c1,c2) = IC (LCS (c1,c2)) 1995-2001 P. Resnik, 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal. Contact Jiang & Conrath (1997) AFE 174 million .66 .88 .80 .77 Experimental Data L. Finkelstein, et al. 2002. Placing search in context: The concept revisited. ACM Transactions on Information jcn (c1, c2) = 1/IC(c1) + IC(c2) – 2 * res (c1, c2) APW 560 million .75 .84 .79 .76 Systems, 20(1):116-131. http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ Ted Pedersen G. A. Miller and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1-28 Department of Computer Science Lin (1998) University of Minnesota, Duluth NYT 963 million .83 .84 .79 .77 H. Rubenstein and J.B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, lin (c1, c2) = 2 * res (c1, c2) / IC(c1) + IC (c2) 8:627-633. Software tpederse@d.umn.edu ADD-1 0 1.00 .85 .77 .76 Pedersen, et al. 2004. WordNet::Similarity – Measuring the relatedness of concepts. In Proceedings of http://www.d.umn.edu/~tpederse HLT/NAACL 2004, pg 38-41, Boston, MA. Poster Design & Printing by Genigraphics® - 800.790.4001

×