Text Similarity

Text Similarity Abdul-baqueeSharaf 11-Feb-2010

First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006. 2

The problem I own a dog I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.

Some applications… 4 Information retrieval using vectoral model [Salton & Lesk 1971] Relevance Feedback and Text Classification [Rocchio 1971] Word sense disambiguation [Lesk 1986; Schutze 1998] Extractive Summarization [Salton et al 1997] Automatic evaluation of machine translation [Papineni et al 2002] Text Summarization [Lin & Hovy 2003] Evaluation of Text coherence [Lapata & Barzilay 2005]

Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5

Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (Brontë) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6

Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7

Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source: [Turney 2001] 9

PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result: 0.80 vs. cosine (0.46) 10

Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11

Example [source: Manning, IR book] 12

Lesk [1986] Similarity of two concepts is defined as a function of the overlap between the corresponding definitions 13

WordNet Hierarchy What is lcs (bike, truck)? Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx 14

Leacock & Chodorow [1998] Length = length of the shortest path between two concepts using node-counting D = max depth of the taxonomy 15

Wu & Palmer ‘94 SimW&P(c1,c2)= 16

Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) = SimJ&C (c1,c2) = 17

Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18

Results Identified similarities 18,000 Lexical matched 14,500 Semantic similarity 3,500 19

Results [cont’d] Pearson correlation 20

Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21

Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22

23 Any Questions before moving to the next paper…

Second Paper Gabrilovich, E.; Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24

Text relatedness From words  concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessary Traditional approaches used to rely on statistical measures 25

Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26

Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27

architecture 28 Given a text fragment,: ,[object Object]

The semantic interpreter iterates over the text words,

retrieves corresponding entries from the inverted index

and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity

Works well on text segments.. 29

..as well on ambiguous words 30

Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles Remove stop-words, and rare words, and use stemming Remaining 389,202 words

Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors

Text Similarity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Text Similarity

Similar to Text Similarity (20)

More from Abdul Baquee Muhammad Sharaf

More from Abdul Baquee Muhammad Sharaf (7)

Recently uploaded

Recently uploaded (20)

Text Similarity