Use of ontologies in natural language processing
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,035
On Slideshare
2,035
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
64
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Use of Ontologies in NaturalLanguage ProcessingAthman HajhamouComputer and Modeling Laboratory –USMBA- FSDM – Fès 1
  • 2. Summary Limitations of classical approaches Use of Ontology State of the Art. 2
  • 3. Limitations of classicalapproaches The huge number of available documents on the Web makes finding relevant ones a challenging task. Full- text search that is still the most popular form of search provided by the most used services such as Google, is very useful to retrieve documents, but it is normally not suitable to find not yet seen relevant documents for a specic topic. 3
  • 4. Limitations of classicalapproaches The major reasons why purely text-based search fails to find some of the relevant documents are the following: Vagueness of natural language : synonyms,homographs and inflection of words can all fool algorithms which see search terms only as a sequence of characters. High-level, vague concepts: High-level, vaguely defined abstract concepts like the Kosovo conict, Industrial Revolution or the Iraq War are often not mentioned explicitly in relevant documents, therefore present search engines cannot find those documents. 4
  • 5. Limitations of classicalapproaches Semantic relations, like the partOf relation, cannot be exploited. For example, if users search for the Great Maghreb, they will not find relevant documents mentioning only Rabat or Morocco. Time dimension: for handling time specifications, keyword matching is not adequate. If we search documents about the “XX century” using exactly this phrase, relevant resources containing the character sequences like 1945 or 1956 will not be found by simple keyword matching. 5
  • 6. Limitations of classicalapproaches Most of the present systems can successfully handle various inflection forms of words using stemming algorithms, it seems that the lots of heuristics and ranking formulas using text- based statistics that were developed during classical IR research in the last decades cannot master the other mentioned issues. One of the reasons is that term co- occurrence that is used by most statistical methods to measure the strength of the semantic relation between words, is not valid from a linguistic-semantical point of view. 6
  • 7. Limitations of classicalapproaches Most of the present systems can successfully handle various inflection forms of words using stemming algorithms, it seems that the lots of heuristics and ranking formulas using text- based statistics that were developed during classical IR research in the last decades cannot master the other mentioned issues. One of the reasons is that term co- occurrence that is used by most statistical methods to measure the strength of the semantic relation between words, is not valid from a linguistic-semantical point of view. 7
  • 8. Limitations of classicalapproaches Besides term co-occurrence-based statistics another way to improve search effectiveness is to incorporate background knowledge into the search process. The IR community concentrated so far on using background knowledge expressed in the form of thesauri. Thesauri define a set of standard terms that can be used to index and search a document collection (controlled vocabulary) and a set of linguistic relations between those terms, thus promise a solution for the vagueness of natural language, and partially for the problem of high-level concepts. 8
  • 9. Limitations of classicalapproaches while intuitively one would expect to see significant gains in retrieval effectiveness with the use of thesauri, experience shows that this is usually not true. One of the major cause is the “noise” of thesaurus relations between thesaurus terms. Linguistic relations, such as synonyms are normally valid only between a specific meaning of two words, but thesauri represent those relations on a syntactic level. Another big problem is that the manual creation of thesauri and the annotation of documents with thesaurus terms is very expensive. As a result, annotations often incomplete or erroneous, resulting in decreased search performance. 9
  • 10. Use of Ontology Ontologies form the basic infrastructure of the Semantic Web. As ontology we consider any formalism with a well-defined mathematical interpretation which is capable at least to represent a subconcept taxonomy, concept instances and user defined relations between concepts. 10
  • 11. Use of Ontology Such formalisms allow a much more sophisticated representation of background knowledge than classical thesauri. They represent knowledge on the semantic level, i.e., they contain semantic entities(concepts, relations and instances) instead of simple words, which eliminates the mentioned noise from the relations. They allow specifying custom semantic relations between entities, and also to store well-known facts and axioms about a knowledge domain (including temporal information). 11
  • 12. Use of Ontology Based on that, ontologies theoretically solve all of the mentioned problems of full text search. Unfortunately, ontologies and semantic annotations using them are hardly ever perfect for the same reasons that were described at thesauri. Indeed, presently good quality ontologies and semantic annotations are a very scarce resource. 12
  • 13. State Of the Art Ontologies as Background Knowledge to Explore Document Collections Nathalie Aussenac-Gilles & Josiane Mothe Institut de Recherche en Informatique de Toulouse 13
  • 14. Ontologies as Background Knowledge toExplore Document Collections An alternative way to go beyond bags of words could be to organise indexing terms into a more complex structure than "bags", such as a hierarchy or an ontology. Texts would be indexed by concepts that reflect their meaning rather than words considered as chart lists with all the ambiguity that they convey. Nathalie A. & Josiane M. promote an approach where information search and exploration take place in a domain-dependant semantic context which is described through its controlled vocabulary organized along hierarchies which are all extracted from a single and unifying domain ontology. Each hierarchy reveals a given point of view on the domain, that is to say a dimension. 14
  • 15. Ontologies as Background Knowledge toExplore Document Collections In this approach, the ontology and derived hierarchies provide the query language for users. Not only can the concept hierarchies be browsed by the user, who can select the terms he wants to add to his query, but they also allow them to explore the information space according to different points of view, through the domain vocabulary and its structure. Given a domain, a use defines its own information space. It is composed of a selection of hierarchies or dimensions among the set of possible ones. This selection depicts his focus of interest, and lead to identify the associated documents. 15
  • 16. Ontologies as Background Knowledge toExplore Document Collections Dimensions and their visualization define a novel way to provide the users with global views and knowledge of the document collection. A key component of this approach is that the domain ontology allows to define a visual presentation of the entire collection or of a sub-collection based on multi-dimensional analysis, as it is done in OLAP systems. 16
  • 17. Ontologies as Background Knowledge toExplore Document Collections 17
  • 18. Ontologies as Background Knowledge toExplore Document Collections 18
  • 19. Ontologies as Background Knowledge toExplore Document Collections Strengths :  with the help of the ontology, users should express their needs more easily.  documents can be seen under many dimensions (or points of view) that could be used in order to extract some knowledge from their content. For the document categorization task, q concept from an ontology can be viewed as a category. Weaknesses : building an ontology is a complex and time- consuming task: experts (domain and ontology experts) often manually do it. the evolution of domain knowledge is problematic, for example new terms appear, other terms are no longer used. 19
  • 20. State Of the Art Ontological Profiles as Semantic Domain Representations Geir Solskinnsbakk & Jon Atle Gulla Norwegian University of Science and Technology 20
  • 21. Ontological Profiles as Semantic DomainRepresentations Ontologies for query disambiguation or reformulation seem more promising, though there is a fundamental problem with comparing ontology concepts with query or document terms. Concepts are abstract notions that are not necessarily linked to a particular term. Some times there may be a number of terms that refer to the same concepts, and some times a specific term may be realizations of different concepts depending on the context. Using conceptual structures to index or retrieve document text requires that there is something bridging the conceptual and real world. Research indicates that ontologies are of little use if they are not aligned with the documents indexed by the search application. 21
  • 22. Ontological Profiles as Semantic DomainRepresentations Geir S. & Jon A. G. present an ontology enrichment approach that both bridges the conceptual and real world and ensures that the ontology is well adapted to the documents at hand. The idea is to provide contextual concept characterizations that reveal how the concepts are referred to semantically in the document collection. 22
  • 23. Ontological Profiles as Semantic DomainRepresentations An ontological profile is an extension of a domain ontology. The ontology is extended with semantically related terms. These terms are added as vectors for each of the concepts of the ontology. This means that in the ontological profile each concept is associated with a vector of semantically related terms (concept vector). The terms are given weights to reflect the importance of the semantic relation between the concept and the terms. The concept vectors typically contain terms that are synonyms to the concept. 23
  • 24. Ontological Profiles as Semantic DomainRepresentations 24
  • 25. Ontological Profiles as Semantic DomainRepresentations The construction of these ontological profiles is based on three different aspects of the content of the documents used. The first is that we apply statistical techniques, counting the frequency of the terms in the documents. Terms that co-occur with a concept more frequent are hypothesized to be more relevant for a concept than terms that do not co-occur as frequently. The second is that we apply linguistic techniques, i.e. stemming, to collapse certain terms into a single form. The third aspect is that we use a proximity analysis of the text. The assumption that lies behind the proximity analysis is that the closer terms are found in the text, the more semantically related they are. 25
  • 26. Ontological Profiles as Semantic DomainRepresentations 26
  • 27. Ontological Profiles as Semantic DomainRepresentations We give the highest weight to terms that are found in the same sentence as the concept name phrase (the highest semantic coherence), terms found in the same paragraph as the concept are given lower weight than sentence-terms, and higher than document terms. The basis for the weight calculation is the term frequency for each term found in the relevant documents. Applying the familiar tf*idf score to the frequencies we get closer to the final representation of the vectors. The idf factor gives more importance to terms that are found in few documents across the document collection. 27
  • 28. Ontological Profiles as Semantic DomainRepresentations is the term frequency for term i in concept vector j, is the term frequency for term i in document vector k, D, P, and S are the possibly empty sets of relevant documents, paragraph documents and sentence documents as signed to j, and a=01, b=10, and c=100 are the constant modifiers for documents, paragraph documents, and sentence documents, respectively. 28
  • 29. Ontological Profiles as Semantic DomainRepresentations is the tfidf score for term i in concept vector j, is the term frequency for term i in concept vector j, is the frequency of the most frequent occurring term i in concept vector j, N is the number of concept vectors, and n is the number of concept vectors containing term i. 29
  • 30. Ontological Profiles as Semantic DomainRepresentations Strengths :  This approach based on ontological profile is used as a tool for semantic reformulation of queries on top of a standard vector space based search engine (Appach Lucene), using the reformulated query as a query into the index. This approach lets the system hide from the user the fact that an ontology is used, and the user is only faced with entering familiar keyword queries. Weaknesses :  In this approach the concept name is considered as a phrase query into the three indexes, and all documents containing the phrase are assigned to the concept as relevant. Of course, using the concept name as a phrase query into the three indexes imposes a challenge; some of the concept names are artificial in their construction or are not used in the form given in the concept. This means that many of the concepts are not found during the assignment of documents to the concepts. 30
  • 31. State Of the Art An Ontology-Based Information Retrieval Model David Vallet, Miriam Fernández & Pablo Castells Universidad Autónoma de Madrid 31
  • 32. An Ontology-Based Information RetrievalModel David V, Miriam F. & Pablo C. propose an ontology-based retrieval model meant for the exploitation of full-fledged domain ontologies and knowledge bases, to support semantic search in document repositories. In contrast to boolean semantic search systems, in this perspective full documents, rather than specific ontology values from a KB, are returned in response to user information needs. The search system takes advantage of both detailed instance-level knowledge available in the KB, and topic taxonomies for classification. This approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. 32
  • 33. An Ontology-Based Information RetrievalModel David V, Miriam F. & Pablo C. propose an ontology-based retrieval model meant for the exploitation of full-fledged domain ontologies and knowledge bases, to support semantic search in document repositories. In contrast to boolean semantic search systems, in this perspective full documents, rather than specific ontology values from a KB, are returned in response to user information needs. The search system takes advantage of both detailed instance-level knowledge available in the KB, and topic taxonomies for classification. This approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. 33
  • 34. An Ontology-Based Information RetrievalModel The system requires that the knowledge base be constructed from three main base classes: DomainConcept, Taxonomy, and Document. DomainConcept should be the root of all domain classes that can be used (directly or after subclassing) to create instances that describe specific entities referred to in the documents. Document is used to create instances that act as proxies of documents from the in-formation source to be searched upon. Taxonomy is the root for class hierarchies that are merely used as classification schemes, and are never instantiated. These taxonomies are expected to be used as a terminology to annotate documents and concept classes, using them as values of dedicated properties. 34
  • 35. An Ontology-Based Information RetrievalModel The predefined base ontology classes described above are complemented with an annotation ontology that provides the basis for the semantic indexing of documents with non- embedded annotations. Documents are annotated with concept instances from the KB by creating instances of the Annotation class, provided for this purpose. Annotation has two relational properties, instance and document, by which concepts and documents are related together. Reciprocally, DomainConcept and Document have a multivalued annotation property. Annotations can be created manually by a domain expert, or semi-automatically. The subclasses ManualAnnotation and AutomaticAnnotation are used respectively 35
  • 36. An Ontology-Based Information RetrievalModel DomainConcept instances use a label property to store the most usual text form of the concept class or instance. This property is multivalued, since instances may have several textual lexical variants. Whenever the label of an instance is found, an annotation is created between the instance and the document. In the system, documents can be annotated with classes as well, by assigning labels to concept classes. The annotations are used by the retrieval and ranking module 36
  • 37. An Ontology-Based Information RetrievalModel In the classic vector-space model, keywords appearing in a document are assigned weights reflecting that some words are better at discriminating between documents than others. In this approach similarly annotations are assigned a weight that reflects how relevant the instance is considered to be for the document meaning. Weights are computed automatically by an adaptation of the TF-IDF algorithm based on the frequency of occurrence of the instances in each document. 37
  • 38. An Ontology-Based Information RetrievalModel wij is the weight of instance Ii for document Dj, is the number of occurrences of Ii in Dj, is the frequency of the most repeated instance in Dj, ni is the number of documents annotated with Ii, and N is the total number of documents in the search space. 38
  • 39. An Ontology-Based Information RetrievalModel The system takes as input a formal RDQL query. This query could be generated from a keyword query, a natural language query, a form-based interface where the user can explicitly select ontology classes and enter property values, or more sophisticated search interfaces. The RDQL query is executed against the knowledge base, which returns a list of instance tuples that satisfy the query and the documents that are annotated with these instances are retrieved, ranked, and presented to the user. 39
  • 40. An Ontology-Based Information RetrievalModel 40
  • 41. An Ontology-Based Information RetrievalModel Strengths : Better recall when querying for class instances and using class hierarchies and rules. Better precision by using query weights and structured semantic queries. Weaknesses : The degree of improvement of this semantic retrieval model depends on the completeness and quality of the ontology, the KB, and the concept labels. 41
  • 42. State Of the Art Improving information retrieval effectiveness by using domain knowledge stored in ontologies Gabor Nagypal University of Karlsruhe, Germany 42
  • 43. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies The quality of results that traditional full-text search engines provide is still not optimal for many types of user queries. Especially the vagueness of natural languages, abstract concepts, semantic relations and temporal issues are handled inadequately by full-text search. Ontologies and semantic metadata can provide a solution for these problems. The goal of this thesis is to examine and validate whether and how ontologies can help improving retrieval effectiveness in information systems, considering the inherent imperfection of ontology-based domain models and annotations. This work examines how ontologies can be optimally exploited during the information retrieval process, and proposes a general framework which is based on ontology- supported semantic metadata generation and ontology-based query expansion. 43
  • 44. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies This research evaluates the following hypotheses : Ontologies allow to store domain knowledge in a much more sophisticated form than thesauri. We therefore assume that by using ontologies in IR systems a significant gain in retrieval effectiveness can be measured. The better (more precise) an ontology models the application domain, the more gain is achieved in retrieval effectiveness. It is possible to diminish the negative effect of ontology imperfection on search results by combining different ontology-based heuristics during the search process. It is a well-known fact that there is a trade-of between algorithm complexity and performance. This insight is also true for ontologies. Still, assumption of this approach is that by combining ontologies with traditional IR methods, it is possible to provide results with acceptable performance. 44
  • 45. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies Background knowledge stored in the form of ontologies can be used at practically every step of the IR process. In this work, solutions are there fore provided for the issues of ontology based query extension, ontology-supported query formulation and ontology-supported metadata generation (indexing). This leads to a conceptual system architecture where the Ontology Manager component has a central role, and it is extensively used by the Indexer, Search Engine and GUI components . 45
  • 46. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies 46
  • 47. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies The information model defines how documents and the user query are represented in the system. The model used in this work represents the content of a resource as a weighted set of instances (bag of ontology instances) from a suitable domain ontology (the conceptual part) together with a weighted set temporal intervals (the temporal part). The representation of the conceptual part is practically identical with the information model used by classical IR engines built on the vector space model, with the difference that vector terms are ontology instances instead of words in a natural language. 47
  • 48. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies Time as a continuous phenomenon has different characteristics than the discrete conceptual part of the information model. The first question according time is how to define similarity among weighted sets of time intervals. A possible solution which is being considered, is to use the temporal vector space model. The main idea of the model is that if we choose a discrete time representation, the lowest level of granules can be viewed as terms and the vector space model is applicable also for the time dimension. 48
  • 49. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies During query formulation we use the ontology only to disambiguate queries specified in textual form. By running classical full-text search on ontology labels, users only have to choose the proper term interpretation. Query process applies various ontology-based heuristics one-by-one to create separate queries which are executed independently using a traditional full-text engine. The ranked results are then combined together to form the final ranked result list. The combination of results is based on the belief network model which allows the combination of various evidences using Bayesian inference. 49
  • 50. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies 50
  • 51. Improving information retrieval effectiveness by usingdomain knowledge stored in ontologies Strengths :  This work validate that the proposed solution significantly improves retrieval effectiveness of information systems and thus provides a strong motivation for developing ontologies and semantic metadata.  The gradual approach described allows a smooth transition from classical text-based systems to ontology-based ones. Weaknesses :  A problem with the temporal vector space approach is the potentially huge number of time granules which are generated for big time intervals. E.g. to represent the existence time of concepts such as the Middle Ages, potentially many tens of thousand terms are needed if we use days as granules. 51