SlideShare a Scribd company logo
1 of 35
Text Similarity Abdul-baqueeSharaf 11-Feb-2010
First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava.  Corpus-based and knowledge-based measures of text semantic similarity.  In AAAI’06, July 2006. 2
The problem I own a dog  I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.
Some applications… 4 Information retrieval using vectoral model  [Salton & Lesk 1971] Relevance Feedback and Text Classification  [Rocchio 1971] Word sense disambiguation  [Lesk 1986; Schutze 1998] Extractive Summarization  [Salton et al 1997] Automatic evaluation of machine translation  [Papineni et al 2002] Text Summarization  [Lin & Hovy 2003] Evaluation of Text coherence  [Lapata & Barzilay 2005]
Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5
Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (Brontë) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6
Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency =  Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7
approaches 8
Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source:  [Turney 2001] 9
PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result:  0.80 vs. cosine (0.46) 10
Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix  ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11
Example [source: Manning, IR book] 12
Lesk [1986] Similarity of two concepts is defined as a function of the overlap between the corresponding definitions  13
WordNet Hierarchy What is lcs (bike, truck)? Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx 14
Leacock & Chodorow [1998] Length = length of the shortest path between two concepts using node-counting D = max depth of the taxonomy 15
Wu & Palmer ‘94 SimW&P(c1,c2)=  16
Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) =  SimJ&C (c1,c2) =  17
Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training  and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18
Results Identified similarities    18,000 Lexical matched           14,500 Semantic similarity	       3,500 19
Results [cont’d] Pearson correlation 20
Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21
Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22
23 Any Questions before moving to the next paper…
Second Paper Gabrilovich, E.; Markovitch, S. (2007)  Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis  Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24
Text relatedness From words  concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’  Background knowledge is necessary Traditional approaches used to rely on statistical measures 25
Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts.  Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments.  The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26
Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts.  Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27
architecture 28 Given a text fragment,: ,[object Object]
The semantic interpreter iterates over the text words,
retrieves corresponding entries from the inverted index
 and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity
Works well on text segments.. 29
..as well on ambiguous words 30
Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles  Remove stop-words, and rare words, and use stemming Remaining 389,202 words
Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors

More Related Content

What's hot

Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 

What's hot (20)

Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
NLTK
NLTKNLTK
NLTK
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
Language models
Language modelsLanguage models
Language models
 

Viewers also liked

similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Machine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesMachine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesAbdul Baquee Muhammad Sharaf
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)ForgetIT Project
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Longest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmLongest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmDarshit Metaliya
 

Viewers also liked (13)

similarity measure
similarity measure similarity measure
similarity measure
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Machine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesMachine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan Verses
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Longest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmLongest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) Algorithm
 
NLP
NLPNLP
NLP
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Text Similarity

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Machine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataMachine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataitstuff
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of InformationAdrian Paschke
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 

Similar to Text Similarity (20)

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
07 04-06
07 04-0607 04-06
07 04-06
 
Machine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataMachine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-data
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
download
downloaddownload
download
 
download
downloaddownload
download
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 

More from Abdul Baquee Muhammad Sharaf (7)

A world of three zeros
A world of three zerosA world of three zeros
A world of three zeros
 
The opening
The openingThe opening
The opening
 
Arabic Grammar Relations
Arabic Grammar RelationsArabic Grammar Relations
Arabic Grammar Relations
 
The Quran and Computational Linguistics
The Quran and Computational LinguisticsThe Quran and Computational Linguistics
The Quran and Computational Linguistics
 
ASAP Methodology in Implementing ERP
ASAP Methodology in Implementing ERPASAP Methodology in Implementing ERP
ASAP Methodology in Implementing ERP
 
Signs of Allah in Nature
Signs of Allah in NatureSigns of Allah in Nature
Signs of Allah in Nature
 
Pairs
PairsPairs
Pairs
 

Recently uploaded

FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Recently uploaded (20)

FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Text Similarity

  • 2. First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006. 2
  • 3. The problem I own a dog I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.
  • 4. Some applications… 4 Information retrieval using vectoral model  [Salton & Lesk 1971] Relevance Feedback and Text Classification  [Rocchio 1971] Word sense disambiguation  [Lesk 1986; Schutze 1998] Extractive Summarization  [Salton et al 1997] Automatic evaluation of machine translation  [Papineni et al 2002] Text Summarization  [Lin & Hovy 2003] Evaluation of Text coherence  [Lapata & Barzilay 2005]
  • 5. Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5
  • 6. Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (Brontë) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6
  • 7. Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7
  • 9. Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source: [Turney 2001] 9
  • 10. PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result: 0.80 vs. cosine (0.46) 10
  • 11. Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11
  • 13. Lesk [1986] Similarity of two concepts is defined as a function of the overlap between the corresponding definitions 13
  • 14. WordNet Hierarchy What is lcs (bike, truck)? Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx 14
  • 15. Leacock & Chodorow [1998] Length = length of the shortest path between two concepts using node-counting D = max depth of the taxonomy 15
  • 16. Wu & Palmer ‘94 SimW&P(c1,c2)= 16
  • 17. Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) = SimJ&C (c1,c2) = 17
  • 18. Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18
  • 19. Results Identified similarities 18,000 Lexical matched 14,500 Semantic similarity 3,500 19
  • 20. Results [cont’d] Pearson correlation 20
  • 21. Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21
  • 22. Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22
  • 23. 23 Any Questions before moving to the next paper…
  • 24. Second Paper Gabrilovich, E.; Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24
  • 25. Text relatedness From words  concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessary Traditional approaches used to rely on statistical measures 25
  • 26. Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26
  • 27. Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27
  • 28.
  • 29. The semantic interpreter iterates over the text words,
  • 30. retrieves corresponding entries from the inverted index
  • 31. and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity
  • 32. Works well on text segments.. 29
  • 33. ..as well on ambiguous words 30
  • 34. Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles Remove stop-words, and rare words, and use stemming Remaining 389,202 words
  • 35. Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors
  • 36. Dataset 33 Word relatedness WordSimilarity-353 collection [Finkelstein et al., 2002] Each pair has 13–16 human judgements, which were averaged for each pair to produce a single relatedness score. Document relatedness collection of 50 documents from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005]. 1,225 pairs has 8–12 human judgements