2. First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006. 2
3. The problem I own a dog I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.
4. Some applications… 4 Information retrieval using vectoral model [Salton & Lesk 1971] Relevance Feedback and Text Classification [Rocchio 1971] Word sense disambiguation [Lesk 1986; Schutze 1998] Extractive Summarization [Salton et al 1997] Automatic evaluation of machine translation [Papineni et al 2002] Text Summarization [Lin & Hovy 2003] Evaluation of Text coherence [Lapata & Barzilay 2005]
5. Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5
6. Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (Brontë) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6
7. Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7
9. Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source: [Turney 2001] 9
10. PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result: 0.80 vs. cosine (0.46) 10
11. Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11
17. Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) = SimJ&C (c1,c2) = 17
18. Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18
21. Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21
22. Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22
24. Second Paper Gabrilovich, E.; Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24
25. Text relatedness From words concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessary Traditional approaches used to rely on statistical measures 25
26. Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26
27. Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27
31. and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity
34. Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles Remove stop-words, and rare words, and use stemming Remaining 389,202 words
35. Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors
36. Dataset 33 Word relatedness WordSimilarity-353 collection [Finkelstein et al., 2002] Each pair has 13–16 human judgements, which were averaged for each pair to produce a single relatedness score. Document relatedness collection of 50 documents from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005]. 1,225 pairs has 8–12 human judgements