SlideShare a Scribd company logo
1 of 35
Text Similarity Abdul-baqueeSharaf 11-Feb-2010
First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava.  Corpus-based and knowledge-based measures of text semantic similarity.  In AAAI’06, July 2006. 2
The problem I own a dog  I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.
Some applications
 4 Information retrieval using vectoral model  [Salton & Lesk 1971] Relevance Feedback and Text Classification  [Rocchio 1971] Word sense disambiguation  [Lesk 1986; Schutze 1998] Extractive Summarization  [Salton et al 1997] Automatic evaluation of machine translation  [Papineni et al 2002] Text Summarization  [Lin & Hovy 2003] Evaluation of Text coherence  [Lapata & Barzilay 2005]
Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5
Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (Brontë) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6
Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency =  Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7
approaches 8
Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source:  [Turney 2001] 9
PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result:  0.80 vs. cosine (0.46) 10
Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix  ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11
Example [source: Manning, IR book] 12
Lesk [1986] Similarity of two concepts is defined as a function of the overlap between the corresponding definitions  13
WordNet Hierarchy What is lcs (bike, truck)? Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx 14
Leacock & Chodorow [1998] Length = length of the shortest path between two concepts using node-counting D = max depth of the taxonomy 15
Wu & Palmer ‘94 SimW&P(c1,c2)=  16
Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) =  SimJ&C (c1,c2) =  17
Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training  and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18
Results Identified similarities    18,000 Lexical matched           14,500 Semantic similarity	       3,500 19
Results [cont’d] Pearson correlation 20
Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21
Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22
23 Any Questions before moving to the next paper

Second Paper Gabrilovich, E.; Markovitch, S. (2007)  Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis  Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24
Text relatedness From words  concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’  Background knowledge is necessary Traditional approaches used to rely on statistical measures 25
Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts.  Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments.  The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26
Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts.  Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27
architecture 28 Given a text fragment,: ,[object Object]
The semantic interpreter iterates over the text words,
retrieves corresponding entries from the inverted index
 and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity
Works well on text segments.. 29
..as well on ambiguous words 30
Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles  Remove stop-words, and rare words, and use stemming Remaining 389,202 words
Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors

More Related Content

What's hot

Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
Language Detection Library for Java
Language Detection Library for Java Language Detection Library for Java
Language Detection Library for Java Shuyo Nakatani
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionKent State University
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System ExplainedCrossing Minds
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs BootcampFiza987241
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 

What's hot (20)

Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Language Detection Library for Java
Language Detection Library for Java Language Detection Library for Java
Language Detection Library for Java
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
NLP
NLPNLP
NLP
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Word embedding
Word embedding Word embedding
Word embedding
 
Text summarization
Text summarization Text summarization
Text summarization
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 

Viewers also liked

similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Machine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesMachine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesAbdul Baquee Muhammad Sharaf
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)ForgetIT Project
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Longest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmLongest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmDarshit Metaliya
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare pptMandy Suzanne
 

Viewers also liked (13)

similarity measure
similarity measure similarity measure
similarity measure
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Machine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan VersesMachine Learning and Quran - The Meccan and Medinan Verses
Machine Learning and Quran - The Meccan and Medinan Verses
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
Information Consolidation and Concentration (WP4 ForgetIT 1st year review)
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Longest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) AlgorithmLongest Common Subsequence (LCS) Algorithm
Longest Common Subsequence (LCS) Algorithm
 
NLP
NLPNLP
NLP
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Text Similarity

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Machine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataMachine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataitstuff
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of InformationAdrian Paschke
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModelingSardhendu Mishra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 

Similar to Text Similarity (20)

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
07 04-06
07 04-0607 04-06
07 04-06
 
Machine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataMachine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-data
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
download
downloaddownload
download
 
download
downloaddownload
download
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 

More from Abdul Baquee Muhammad Sharaf (7)

A world of three zeros
A world of three zerosA world of three zeros
A world of three zeros
 
The opening
The openingThe opening
The opening
 
Arabic Grammar Relations
Arabic Grammar RelationsArabic Grammar Relations
Arabic Grammar Relations
 
The Quran and Computational Linguistics
The Quran and Computational LinguisticsThe Quran and Computational Linguistics
The Quran and Computational Linguistics
 
ASAP Methodology in Implementing ERP
ASAP Methodology in Implementing ERPASAP Methodology in Implementing ERP
ASAP Methodology in Implementing ERP
 
Signs of Allah in Nature
Signs of Allah in NatureSigns of Allah in Nature
Signs of Allah in Nature
 
Pairs
PairsPairs
Pairs
 

Recently uploaded

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

Text Similarity

  • 2. First paper RadaMihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006. 2
  • 3. The problem I own a dog I have an animal When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. 3 The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls.
  • 4. Some applications
 4 Information retrieval using vectoral model  [Salton & Lesk 1971] Relevance Feedback and Text Classification  [Rocchio 1971] Word sense disambiguation  [Lesk 1986; Schutze 1998] Extractive Summarization  [Salton et al 1997] Automatic evaluation of machine translation  [Papineni et al 2002] Text Summarization  [Lin & Hovy 2003] Evaluation of Text coherence  [Lapata & Barzilay 2005]
  • 5. Solution 1 Lexical similarity Simple lexical matching Using vectoral model 5
  • 6. Vector Space Model SaS = Sense and Sensibility (Austen) PaP = Pride and Prejudice (Austen) WH= Wuthering Heights (BrontĂ«) Sim(SaS,PaP) = 0.999 Sim(SaS, WH) = 0.888 Problems: synonymy, polysemy [source: Manning et al, IR book] 6
  • 7. Solution 2 (this paper) Leverage on existing word-to-word similarity measures either from corpus-based or knowledge-based Reduction of text similarity into word-to-word similairty maxSim = highest word-to-word similarity based on one of 8 similarity measures (next slides) idf =inverse document frequency = Specificity: [collie, sheepdog] > [get, become] Only open-class words that share same POS 7
  • 9. Pointwise Mutual Information (PMI) Unsupervised Based on co-occurrence in a very large corpora NEAR query: co-occurrence within a ten-word window 72.5% accuracy on identifying the correct synonym out of 4 TOEFL synonym choices. Source: [Turney 2001] 9
  • 10. PMI - Example When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him Using PMI-IR and NEAR operator of AltaVista Result: 0.80 vs. cosine (0.46) 10
  • 11. Latent Semantic Analysis [Landauer 1998] Term co-occurrence are captured by means of dimensionality reduction through “singular value decomposition” on the term document matrix T T = term-by-document matrix ∑k = diagonal k x k matrix U and V are column-orthogonal matrices. 11
  • 13. Lesk [1986] Similarity of two concepts is defined as a function of the overlap between the corresponding definitions 13
  • 14. WordNet Hierarchy What is lcs (bike, truck)? Source: http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx 14
  • 15. Leacock & Chodorow [1998] Length = length of the shortest path between two concepts using node-counting D = max depth of the taxonomy 15
  • 16. Wu & Palmer ‘94 SimW&P(c1,c2)= 16
  • 17. Resnik ’95 || Lin ’98 || Jiang & Conrath ‘97 Based on information content (IC) IC(‘carving fork’) > IC(‘entity’) IC(concept) = -log (p(concept)) SimResnik(c1,c2) = IC(lcs(c1,c2)) Problem: Lcs(jumbo jet, tank, house trailer, ballastic missile) = vehicle SimLin(c1,c2) = SimJ&C (c1,c2) = 17
  • 18. Experiment Automatically identify if two text segments are paraphrases of each other Corpus: Microsoft paraphrase corpus [Dolan et al 2004] 4,076 training and 1,725 test set News source over 18 months Human labelled with 83% agreement The system labels a pair as ‘paraphrase’ if score > 0.5 Baselines: Random baseline Vector-based using cosine similarity 18
  • 19. Results Identified similarities 18,000 Lexical matched 14,500 Semantic similarity 3,500 19
  • 21. Discussion Only Resnik, PMI and LSA passed: NOT PARAPHRASE: only cosine and Resnik got it! Corpus-Based merits: No hand made resource are needed Knowledge-based merits: Encode fine grained information Gateway ’s all-in-one PC , the Profile 4 , also now features the new Intel processor. Gateway will release new Profile 4 systems with the new Intel technology on Wednesday. The man wasn’t on the ice, but trapped in the rapids, swaying in an eddy about 250 feet from the shore. The man was trapped about 250 feet from the shore, right at the edge of the falls. 21
  • 22. Improvements Bag-of-words approach ignores important relationships Hence, consider more sophisticated representation of sentence structure: First-order predicate logic Semantic parse tree 22
  • 23. 23 Any Questions before moving to the next paper

  • 24. Second Paper Gabrilovich, E.; Markovitch, S. (2007) Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Proceedings of the 20th International Joint Conference on Artificial Intelligence, January, p.1606--1611 24
  • 25. Text relatedness From words  concepts ‘cat’ – ‘mouse’ ‘preparing a manuscript’ – ‘writing an article’ Background knowledge is necessary Traditional approaches used to rely on statistical measures 25
  • 26. Paper Contribution Explicit Semantic Analysis (ESA): a new approach to representing semantics of natural language texts using natural concepts. Propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. The results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. 26
  • 27. Explicit Semantic Analysis (ESA), fine-grained semantic representation of unrestricted natural language texts. Represent meaning in a high-dimensional space of natural concepts derived from Wikipedia 27
  • 28.
  • 29. The semantic interpreter iterates over the text words,
  • 30. retrieves corresponding entries from the inverted index
  • 31. and merges them into a weighted vector of concepts that represents the given text.Centroid based classifier [Han and Karypis, 2000] Cosine similarity
  • 32. Works well on text segments.. 29
  • 33. ..as well on ambiguous words 30
  • 34. Experiment Setup- Wikipedia 31 Wikipedia XML dump on 26th March 2006 1,187,839 articles Remove small concepts with less than 100 words and fewer than 5 incoming or outgoing links Remaining 241,393 articles Remove stop-words, and rare words, and use stemming Remaining 389,202 words
  • 35. Experiment Setup - ODP 32 Open Directory Project (ODP, http://www.dmoz.org). April 2004 Hierarchy of over 400,000 concepts and 2,800,000 URLs 20,700,000 distinct terms used to represent ODP nodes as attribute vectors
  • 36. Dataset 33 Word relatedness WordSimilarity-353 collection [Finkelstein et al., 2002] Each pair has 13–16 human judgements, which were averaged for each pair to produce a single relatedness score. Document relatedness collection of 50 documents from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005]. 1,225 pairs has 8–12 human judgements