Text SimilarityAbdul-baqueeSharaf11-Feb-2010
First paperRadaMihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006.2
The problemI own a dog I have an animalWhen the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on himWhen the defendant walked into the courthouse with his attorney, the crowd turned their backs on himGateway ’s all-in-one PC , the Profile 4 ,also now features the new Intel processor.Gateway will release new Profile 4 systems with the new Intel technology on Wednesday.3The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore.The man was trapped about 250 feet from the shore, right at the edge of the falls.
Some applications…4Information retrieval using vectoral model [Salton & Lesk 1971]Relevance Feedback and Text Classification [Rocchio 1971]Word sense disambiguation [Lesk 1986; Schutze 1998]Extractive Summarization [Salton et al 1997]Automatic evaluation of machine translation [Papineni et al 2002]Text Summarization [Lin & Hovy 2003]Evaluation of Text coherence [Lapata & Barzilay 2005]
Solution 1Lexical similaritySimple lexical matchingUsing vectoral model5
Vector Space ModelSaS = Sense and Sensibility (Austen)PaP = Pride and Prejudice (Austen)WH= Wuthering Heights (Brontë)Sim(SaS,PaP) = 0.999Sim(SaS, WH) = 0.888Problems: synonymy, polysemy[source: Manning et al, IR book]6
Solution 2 (this paper)Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-basedReduction of text similarity into word-to-word similairtymaxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides)idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become]Only open-class words that share same POS7
approaches8
Pointwise Mutual Information (PMI)UnsupervisedBased on co-occurrence in a very large corporaNEAR query: co-occurrence within a ten-word window72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices.Source:  [Turney 2001]9
PMI - ExampleWhen the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on himWhen the defendant walked into the courthouse with his attorney, the crowd turned their backs on himUsing PMI-IR and NEAR operator of AltaVistaResult:  0.80 vs. cosine (0.46)10
Latent Semantic Analysis[Landauer 1998]Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix TT = term-by-document matrix ∑k = diagonal k x k matrixU and V are column-orthogonal matrices.11
Example[source: Manning, IR book]12
Lesk [1986]Similarity of two concepts is defined as a function of the overlap between the corresponding definitions 13
WordNet HierarchyWhat is lcs (bike, truck)?Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx14
Leacock & Chodorow [1998]Length = length of the shortest path between two concepts using node-countingD = max depth of the taxonomy15
Wu & Palmer ‘94SimW&P(c1,c2)= 16
Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97Based on information content (IC)IC(‘carving fork’) > IC(‘entity’)IC(concept) = -log (p(concept))SimResnik(c1,c2) = IC(lcs(c1,c2))Problem:Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicleSimLin(c1,c2) = SimJ&C (c1,c2) = 17
ExperimentAutomatically identify if two text segments are paraphrases of each otherCorpus: Microsoft paraphrase corpus [Dolan et al 2004]4,076 training  and 1,725 test setNews source over 18 monthsHuman labelled with 83% agreementThe system labels a pair as ‘paraphrase’ if score > 0.5Baselines:Random baselineVector-based using cosine similarity18
ResultsIdentified similarities    18,000Lexical matched           14,500Semantic similarity	       3,50019
Results [cont’d]Pearson correlation20
DiscussionOnly Resnik, PMI and LSA passed:NOT PARAPHRASE: only cosine and Resnik got it!Corpus-Based merits:No hand made resource are neededKnowledge-based merits:Encode fine grained informationGateway ’s all-in-one PC , the Profile 4 ,also now features the new Intel processor.Gateway will release new Profile 4 systems with the new Intel technology on Wednesday.The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore.The man was trapped about 250 feet from the shore, right at the edge of the falls.21
ImprovementsBag-of-words approach ignores important relationshipsHence, consider more sophisticated representation of sentence structure:First-order predicate logicSemantic parse tree22
23Any Questions before moving to the next paper…
Second PaperGabrilovich, E.; Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--161124
Text relatednessFrom words  concepts‘cat’ – ‘mouse’‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessaryTraditional approaches used to rely on statistical measures25
Paper ContributionExplicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art.26
Explicit Semantic Analysis (ESA),fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia27
architecture28Given a text fragment,:we first represent it as a vector using TFIDF scheme.
The semantic interpreter iterates over the text words,
retrieves corresponding entries from the inverted index
 and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000]Cosine similarity
Works well on text segments..29
..as well on ambiguous words30
Experiment Setup- Wikipedia31Wikipedia XML dump on 26th March 20061,187,839 articlesRemove small concepts with less than 100 words and fewer than 5 incoming or outgoing linksRemaining 241,393 articles Remove stop-words, and rare words, and use stemmingRemaining 389,202 words
Experiment Setup - ODP32Open Directory Project (ODP, http://www.dmoz.org).April 2004Hierarchy of over 400,000 concepts and 2,800,000 URLs20,700,000 distinct terms used to represent ODP nodes as attribute vectors

Text Similarity

  • 1.
  • 2.
    First paperRadaMihalcea, CourtneyCorley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006.2
  • 3.
    The problemI owna dog I have an animalWhen the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on himWhen the defendant walked into the courthouse with his attorney, the crowd turned their backs on himGateway ’s all-in-one PC , the Profile 4 ,also now features the new Intel processor.Gateway will release new Profile 4 systems with the new Intel technology on Wednesday.3The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore.The man was trapped about 250 feet from the shore, right at the edge of the falls.
  • 4.
    Some applications…4Information retrievalusing vectoral model [Salton & Lesk 1971]Relevance Feedback and Text Classification [Rocchio 1971]Word sense disambiguation [Lesk 1986; Schutze 1998]Extractive Summarization [Salton et al 1997]Automatic evaluation of machine translation [Papineni et al 2002]Text Summarization [Lin & Hovy 2003]Evaluation of Text coherence [Lapata & Barzilay 2005]
  • 5.
    Solution 1Lexical similaritySimplelexical matchingUsing vectoral model5
  • 6.
    Vector Space ModelSaS= Sense and Sensibility (Austen)PaP = Pride and Prejudice (Austen)WH= Wuthering Heights (Brontë)Sim(SaS,PaP) = 0.999Sim(SaS, WH) = 0.888Problems: synonymy, polysemy[source: Manning et al, IR book]6
  • 7.
    Solution 2 (thispaper)Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-basedReduction of text similarity into word-to-word similairtymaxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides)idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become]Only open-class words that share same POS7
  • 8.
  • 9.
    Pointwise Mutual Information(PMI)UnsupervisedBased on co-occurrence in a very large corporaNEAR query: co-occurrence within a ten-word window72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices.Source: [Turney 2001]9
  • 10.
    PMI - ExampleWhenthe defendant and his lawyer walked into the court, some of the victim supporters turned their backs on himWhen the defendant walked into the courthouse with his attorney, the crowd turned their backs on himUsing PMI-IR and NEAR operator of AltaVistaResult: 0.80 vs. cosine (0.46)10
  • 11.
    Latent Semantic Analysis[Landauer1998]Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix TT = term-by-document matrix ∑k = diagonal k x k matrixU and V are column-orthogonal matrices.11
  • 12.
  • 13.
    Lesk [1986]Similarity oftwo concepts is defined as a function of the overlap between the corresponding definitions 13
  • 14.
    WordNet HierarchyWhat islcs (bike, truck)?Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx14
  • 15.
    Leacock & Chodorow[1998]Length = length of the shortest path between two concepts using node-countingD = max depth of the taxonomy15
  • 16.
    Wu & Palmer‘94SimW&P(c1,c2)= 16
  • 17.
    Resnik ’95 ||Lin ’98 || Jiang & Conrath ‘97Based on information content (IC)IC(‘carving fork’) > IC(‘entity’)IC(concept) = -log (p(concept))SimResnik(c1,c2) = IC(lcs(c1,c2))Problem:Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicleSimLin(c1,c2) = SimJ&C (c1,c2) = 17
  • 18.
    ExperimentAutomatically identify iftwo text segments are paraphrases of each otherCorpus: Microsoft paraphrase corpus [Dolan et al 2004]4,076 training and 1,725 test setNews source over 18 monthsHuman labelled with 83% agreementThe system labels a pair as ‘paraphrase’ if score > 0.5Baselines:Random baselineVector-based using cosine similarity18
  • 19.
    ResultsIdentified similarities 18,000Lexical matched 14,500Semantic similarity 3,50019
  • 20.
  • 21.
    DiscussionOnly Resnik, PMIand LSA passed:NOT PARAPHRASE: only cosine and Resnik got it!Corpus-Based merits:No hand made resource are neededKnowledge-based merits:Encode fine grained informationGateway ’s all-in-one PC , the Profile 4 ,also now features the new Intel processor.Gateway will release new Profile 4 systems with the new Intel technology on Wednesday.The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore.The man was trapped about 250 feet from the shore, right at the edge of the falls.21
  • 22.
    ImprovementsBag-of-words approach ignoresimportant relationshipsHence, consider more sophisticated representation of sentence structure:First-order predicate logicSemantic parse tree22
  • 23.
    23Any Questions beforemoving to the next paper…
  • 24.
    Second PaperGabrilovich, E.;Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--161124
  • 25.
    Text relatednessFrom words concepts‘cat’ – ‘mouse’‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessaryTraditional approaches used to rely on statistical measures25
  • 26.
    Paper ContributionExplicit SemanticAnalysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art.26
  • 27.
    Explicit Semantic Analysis(ESA),fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia27
  • 28.
    architecture28Given a textfragment,:we first represent it as a vector using TFIDF scheme.
  • 29.
    The semantic interpreteriterates over the text words,
  • 30.
    retrieves corresponding entriesfrom the inverted index
  • 31.
    and mergesthem into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000]Cosine similarity
  • 32.
    Works well ontext segments..29
  • 33.
    ..as well onambiguous words30
  • 34.
    Experiment Setup- Wikipedia31WikipediaXML dump on 26th March 20061,187,839 articlesRemove small concepts with less than 100 words and fewer than 5 incoming or outgoing linksRemaining 241,393 articles Remove stop-words, and rare words, and use stemmingRemaining 389,202 words
  • 35.
    Experiment Setup -ODP32Open Directory Project (ODP, http://www.dmoz.org).April 2004Hierarchy of over 400,000 concepts and 2,800,000 URLs20,700,000 distinct terms used to represent ODP nodes as attribute vectors