Information Retrieval using Semantic Similarity


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information Retrieval using Semantic Similarity

  1. 1. Seminar on Artificial Intelligence Information Retrieval Using Semantic Similarity Harshita Meena (100050020) Diksha Meghwal (100050039) Saswat Padhi (100050061)
  2. 2. Overview ... ● “Semantics” & “Ontology” (Diksha) ● ● ● ● Semantic Similarity (Harshita) ● ● ● ● What is IR lacking? Semantics: “What”? And How? Ontologies and knowledge representation Semantic Similarity: What? and How? Path based semantic similarity measures Information content based similarity measures Information Retrieval (Saswat) ● ● ● VSM Revisited SSRM: IR with semantics Conclusion and further reading
  3. 3. “Semantics” & “Ontology” What is IR (without semantics) lacking? “MEANING” Query: software Pool: application, program, package, freeware, shareware Result: No match!! motivation for looking at semantic rather than lexical similarity The problem today in information retrieval is not lack of data, but the lack of “structured” and “meaningful organisation” of data. Ontologies are attempts to organise information and empower IR.
  4. 4. “Semantics” & “Ontology” Semantics: What? And How? “Semantics” capture the meaning of the linguistic terms. Computers do not understand “meaning”. So, the semantic meanings of terms are rather represented using links to other terms. An “ontology” formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about entities. Formal definition by Tom Gruber: An ontology is a formal, explicit specification of a shared conceptualization ● ● ● ● formal: it should be machine readable explicit: types of concept and the constraints are explicitly defined shared: the ontology is agreed upon and accepted by a group conceptualization: An abstract model that consists of relevant models and the relationships between them
  5. 5. “Semantics” & “Ontology” Components of Ontologies Classes : Classes are abstract groups, or collections of objects. They may contain individuals, other classes, or a combination of both. Classes can be extensional or intensional, subsume or be subsumed. ● Attributes: Used to store information that is specific to the object it is attached to like its features or characteristics. ● Relationships: A relation is an attribute whose value is another object in the ontology. Eg: subsumption relations(is-superclass-of, the converse of is-a, is-subtype-of or is-subclass-of), meronym relations(part-of). ● Domain ontology (or domain-specific ontology) models a specific domain, or part of the world. ● Upper ontology (or foundation ontology) is a model of the common objects that are applicable across a range of domain ontologies. ●
  6. 6. “Semantics” & “Ontology” Examples of Popular Ontologies WordNet Medical Subject Headings WordNet is a lexical database for the English language, which superficially resembles a thesaurus. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. MeSH is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings.
  7. 7. “Semantics” & “Ontology” The Future: “Semantic Web”, OWL and RDF ... The Semantic Web is a collaborative movement led by the international standards body W3C. Semantic Web aims at converting the current web dominated by semi-structured documents into a organised "web of data". RDF(Resource Description Framework) is a part of the W3C family of specifications, which can be used as a general method for conceptual description or modeling of information. <rdf:RDF <rdf:Description rdf:about=""> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> OWL is built on top of the RDF and is stronger and supports greater machine interpretability than RDF. <rdf:RDF <owl:Ontology rdf:about=""> <dc:title>The Example Plant Ontology</dc:title> <dc:description>An example ontology</dc:description> </owl:Ontology> <owl:Class rdf:about=""> <rdfs:label>The plant type</rdfs:label> <rdfs:comment>The class of plant types.</rdfs:comment> </owl:Class> </rdf:RDF>
  8. 8. Semantic Similarity Ontology is just a “structure”, without any weights on the edges. Semantic similarity measures exploit the structure information and try to quantify the concept similarities in a given ontology. Ontology based semantic measures can be classified as follows: ● ● ● Path Based Similarity Measures Path based similarity measures utilize the information of the shortest path between two concepts, their generality or specificity and their relationship with other concepts. Information Content Based Similarity Measures Information content based measures associate a quantity IC which takes into account, the probabilities of concepts in the ontology. Feature Based Similarity Measures (we won't be discussing)
  9. 9. Semantic Similarity (Path Based) Wu & Palmer Measure: 2H ( N 1 + N 2 +2H) Wu and Palmer measure fits the intuition that concepts with greater depth would be more similar (because of specificity). N1 and N2 are the number of IS-A links from C1 and C2 respectively to the most specific common subsumer concept C. H is the number of IS-A links from C to the root of ontology. simW & P (C 1 ,C 2 )= Li Measure: e βH −e−βH sim Li (C 1 ,C 2 )=e ⋅ βH −βH e +e Li combines the shortest path and the depth of ontology information in a non-linear function. L stands for the shortest path between two concepts, α and β are scaling factors. H is same as in Wu & Palmer measure. −αL
  10. 10. Semantic Similarity (Path Based) Leacock & Chodorow Measure: L sim L & C (C 1 ,C 2 )=−log 2H This is almost the same as Wu & Palmer method, except logarithmic smoothing and removal of depth factor from denominator. As in the Li Measure, L is the shortest path between concepts C1 and C2. H is the number of IS-A links from C to the root of ontology. Mao Measure: δ sim Mao (C 1 , C 2 )= L log 2 (1+d (C 1 )+d (C 2 )) Mao measure considers the generality of the concepts by taking into account, the number of descendants. L stands for the shortest path between two concepts, d(C) stands for number of descendants of C. δ is a constant (usually chosen as 0.9).
  11. 11. Semantic Similarity (IC Based) The intuition behind information content is that, more frequent terms are more general and hence provide less “information”: IC (C )=−log p(C )=−log freq(C ) freq (root ) freq(C) is the frequency of concept C, and freq(root) is the frequency of root concept of the ontology. Frequency includes the frequencies of subsumed concepts in an IS-A hierarchy. We call concept C the most informative subsumer of two concepts C1 and C2 i.e. ICmis(C1,C2) if concept C has the least probability among all shared subsumer between two concepts (thus most informative). Resnik Measure: sim Resnik (C 1 , C 2 )=IC mis (C 1 , C 2 ) More the information two terms share, the more similar they are.
  12. 12. Semantic Similarity (IC Based) Jiang Measure: dist Jiang (C 1 ,C 2 )=IC (C 1 )+ IC (C 2 )−2ICmis (C 1 ,C 2 ) Jiang measure considers the information content of each term apart from shared information content. It is an inverted measurement. The distance between two concepts is the amount of information needed to fully describe both the concepts, excluding the amount of information that is common to both of them. Lin Measure: 2ICmis (C 1 ,C 2 ) sim Lin (C 1 , C 2 )= IC (C 1 )+ IC (C 2 ) Lin measure also the information contents of each term, but uses them differently than Jiang. It takes ratio instead of difference. Since ICmis(C1,C2) < IC (C1) and IC (C2) the similarity value is normalized between 1 ( similar concepts) and 0.
  13. 13. Semantic Similarity Correlation with human judgements Method Type Correlation Method Type Correlation Wu & Palmer Path 0.74 Wu & Palmer Path 0.67 Li Path 0.82 Li Path 0.70 Leacock Path 0.82 Leacock Path 0.74 Resnik IC 0.79 Resnik IC 0.71 Lin IC 0.82 Lin IC 0.72 Jiang IC 0.83 Jiang IC 0.71 WordNet Ontology MeSH Ontology
  14. 14. Information Retrieval SSRM: IR with semantics ... (0/3) VSM Revisited: ● Similarity in VSM is the cosine inner product: ∑ qi d i sim(q , d )= i ∑ q 2⋅√ ∑ d i2 √ i i ● i Each dimension corresponds to a separate term. q and d are n-dimensional vectors with weights for each term. ● qi and di are weights of the query and document terms ● The document term weight, di = tfi • idfi ● Specifically, I would talk about SSRM algorithm (Semantic Similarity Retrieval Model), where we modify the query term weights to consider semantic similarity.
  15. 15. Information Retrieval SSRM: IR with semantics ... (1/3) Query Re-weighting: ● Query can contain related (semantically similar) terms Query: free scientific computing software ● We need to re-weight the query terms to stress a particular concept we are searching. i≠ j qi ' =q i + ∑ q j⋅sim(i , j) sim (i , j )⩾t ● qi and qi' are old and new weights respectively ● i and j refer to different terms in the query.
  16. 16. Information Retrieval SSRM: IR with semantics ... (2/3) Query Expansion: ● ● New terms that might be semantically similar to query terms. We “expand” the queries by adding new terms in the neighbourhood of the query term, in the ontology. Adding such terms would affect weights of existing terms. { i≠ j qj ∑ n ⋅sim(i , j) q i ' = sim (i , j)⩾T j i≠ qj qi + ∑ ⋅sim(i , j ) n sim (i , j)⩾T ● n is the number of hyponyms for each expanded term j. i is a new term i had weight q i
  17. 17. Information Retrieval SSRM: IR with semantics ... (3/3) Document Similarity: ● After we have the expanded and re-weighted query vectors and the document vectors using tf-idf, we calculate the query-document similarity between query q and document d as: ∑ ∑ qi⋅d j⋅sim(i , j ) sim (q , d )= ● Properties: i j ∑ ∑ qi⋅d j i j ● Symmetric. ● Normalized in [0,1]. ● Consistent behaviour. ● Can be easily tweaked for document-document similarity.
  18. 18. Information Retrieval SSRM: At a glance
  19. 19. Information Retrieval SSRM Implementation Notes: Quadratic time complexity as opposed to VSM. ● Similarity between every pair or terms can be hashed. ● Expensive to expand and re-weight the document vectors as well, so only re-weight and expand queries. But expanding one of the vectors should incorporate enough semantic info. ● Thresholds (t, T) need to be adjusted for optimal behaviour. ● Although behaviour of SSRM is consistent, SSRM won't result in sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1. ● I had proposed the following formula last summer and the results on MeSH were quite satisfactory: ● ∑ ∑ qi⋅d j⋅maxsimi sim( q , d )= i j ∑ ∑ q i⋅d j i j where maxsimi = max sim(i , j) j
  20. 20. Experimental Results IR on OSHUMED using MeSH IR on web using WordNet
  21. 21. Future ... Possible Issues ● ● ● Negation ● Query: I like pizza Antonymy ● Query: Slow runner Role Reversal ● Query: Dog bites man Match: I don't like pizza Match: Fast runner Match: Man bites dog Further reading ● Groupwise Semantic Similarity ● ● ● Jaccard Index simLP, simUI, simGIC Statistical Semantic Similarity ● ● ● LSA: Latent Semantic Analysis NGD: Normalized Google Distance PMI: Pointwise Mutual Information
  22. 22. References ● A comparative study of ontology based term similarity measures on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Xiaohua Zhou] [2007]. ● Information retrieval by semantic similarity [A. Hilaoutakis, G. Varelas, E. Voutsakis] [2006]. ● Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy [Jay J. Jiang, David W. Conrath] [1997]. Thank you!