Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto


Published on

Published in: Engineering, Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto

  1. 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 11, Issue 2 (May. - Jun. 2013), PP 101-117 101 | Page Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto Sonam Mittal1 , Nupur Mittal2 1 Computer Science, B.K. Birla Institute of Engineering & Technology, Pilani, Rajasthan, India 2 Computer Science, Ecole Polytechnique de l’Universit´e de Nantes, France Abstract: Building ontologies from texts is a difficult and time-consuming process. Several tools have been developed to facilitate this process. However, these tools are not mature enough to automate all tasks to build a good ontology without human intervention. Among these tools, Text2Onto is a one for learning ontology from textual data. This case study aims at understanding the architecture and working principle of Text2Onto, analyzing the errors that Text2Onto can produce and finding a solution to reduce human intervention as well as to improve the result of Text2Onto.Three texts of different length were used in the experiment. Quality of Text2Onto results was assessed by comparing the entities extracted by Text2Onto with the ones extracted manually. Some causes of errors produced by Text2Onto were identified too. As an attempt to improve the result of Text2Onto, change discovery feature of Text2Onto was used. Meta- model of the given text was fed to Text2Onto to obtain a POM on top of which an ontology was built for the existing text. The meta-model ontology was aimed to identify all the core concepts and relations as done in the manual ontology and the ultimate objective was to improve the hierarchy of the of the ontology. The use of meta model should help to better classify the concepts under various core concepts. Keywords: Ontology, Text2Onto I. Introduction In the current scenario, use of domain ontology has been increasing. To make such domain ontologies, general method used is extracting ontology from textual resources. It involves processing of huge amount of texts which makes it a difficult and time-consuming task. In order to expedite the process and support the ontogists in different phases of ontology building process, several tools based on linguistic or statistical techniques have been developed. However, the tools are not fully automated yet. Human intervention is required at some phases of the tools to validate the results of the tools so as to produce a good result. Such human intervention is not only time consuming but also error-prone. Therefore, minimizing human activities for error correction is a key for enhancing these tools. Text2Onto is a framework for learning ontologies from textual data. It can extract different ontology components like concepts, relations, instances, hierarchy etc from documents. It also gives some statistical values which help to understand the importance of those components in the text. However, users have to verify its results. We, therefore, studied this tool in order to assess how relevant its results are and to check if its result can be improved. For this purpose, first of all, architecture and working principles of Text2Onto were studied. Then we performed some experiments. To assess the results, we mainly considered concepts, instances and relations. We also observed taxonomy. However, the detailed study revolved around these three components. II. Literature Review This section gives brief overview of Ontology, Ontology building processes and sums up the papers [1], [3], [4], [5], [6], [7]. 2.1 Ontology An ontology is an explicit, formal specification (i.e. machine readable) of a shared (accepted by a group or community) conceptualization of a domain of interest [2]. It should be restricted to a given domain of interest and therefore model concepts and relations that are relevant to a particular task or application domain. Ontologies are built to be reused or shared anytime, anywhere and independently of the behavior and domain of the application that uses them. The process of instantiating the a knowledge base is referred to as ontology population whereas the automatic support in ontology development is usually referred to as ontology learning. Ontology learning is concerned with knowledge acquisition.
  2. 2. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 102 | Page 2.2 Ontology life cycle Ontology development process refers to what activities are carried out to build the ontologies from scratch.[1] In order to start the ontology development process, there is a need to plan out the activities to be carried out and the resources used for them. Thus an ontology specification document is prepared in order to write the requirements and the specifications of the ontology development process. The process of ontology building starts with conceptualization of the acquired knowledge in a conceptual model in order to describe the problem and its solution with the help of some intermediate representations. Next, the conceptual models are formalized into formal or semi-compatible formal models using frame-oriented or Description Logic (DL) representation systems. The next step is to integrate the current ontology with the existing ontologies. Though this is an optional step, we should consider reusing existing ontologies in order to avoid duplicate effort in building them. After this, the ontology is implemented in a formal language like OWL, RDF etc. Once the ontology is implemented, it is evaluated to make a technical judgment with respect to a frame of reference. There is a need to document the ontology to the best possible extent. Finally, efforts are put to maintain and update the ontology. There can be various ways to follow these activities to develop the ontology. The most common among them are water fall life cycle and incremental life cycle. III. Methontology Methontology [1] is a well-structured methodology used to build ontologies from scratch. It follows a certain number of well-defined steps to guide the ontology development process. Methontology follows the order of specification, knowledge acquisition, conceptualization, implementation, evaluation and documentation activities in order to carry out the ontology development process. It also identifies the management activities like schedule, control and quality assurance and some support activities like integration and evaluation. 3.1 Specification The first phase according to Methontology is specification where an ontology specification document is a formal or semi-formal document written in natural language (NL) having information like purpose of the ontology, level of formality implemented in the ontology, scope of ontology and source of knowledge. A good design of this document is the one where each and every term is relevant and has partial completeness and ensures consistency of all the terms. 3.2 Knowledge Acquisition The specification is followed by knowledge acquisition, which is an independent activity performed using techniques like brainstorming, interviews, formal questions, non-structured interviews, informal text analysis, formal text analysis, structured interviews and knowledge acquisition tools. 3.3 Conceptualization The next step is structuring the domain knowledge in a conceptual model. This is the step of conceptualization where a glossary of terms is built, relations are identified, taxonomy is defined, the data dictionary is implemented and table of rules and formula is made. Data dictionary describes and gathers all the useful and potentially usable domain concepts, their meanings, attributes, instances, etc. Table of instance attributes provide information about the attribute or about its values at the instance. Thus the result of this phase of Methontology is a conceptual model expressed as a set of well-defined deliverables which allow to access the usefulness of the ontology and to compare the scope and completeness of various other ontologies. 3.4 Integration Integration is an optional step that is used to accelerate the process of building ontology by merging various already existing related ontologies. This leads to inspection of the meta-ontologies and then to find out the best suited libraries to provide term definition. As a result, Methontology produces an integration document summarizing the meta-ontology, the name of the terms to be used from conceptual model and the name of the ontology from which the corresponding definition is taken. Methontology highly recommends the use of already existing ontologies. 3.5 Implementation Implementation of the ontology is done using a formal language and an ontology development environment which is incorporated with a lexical and syntactic analyzer so as to avoid lexical and syntactic errors.
  3. 3. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 103 | Page 3.6 Evaluation Once the ontology has been implemented, they are judged technically which results in a small evaluation document where the methods used to evaluate the ontology will be described. 3.7 Documentation Documentation should be carried out during all the above steps. It is the summing of the steps, procedures and results of each step in a written document. IV. Ontology Learning Layers Different aspects of Ontology Learning (OL) have been presented in the form of a stack on the paper [6]. OL involves the processing of different layers of this stack. It follows an order of identifying the terms (linguistic realizations of domain-specific concepts), finding out their synonyms, categorizing them as concepts, defining concept hierarchies, relations and describing rules in order to restrict the concepts. Different ontology components and the methods for extracting them are explained in the following sections in details. V. Ontology modeling components Methontology deals to conceptualize ontologies with a tabular and graphical IRs. The components of such IRs are: Concepts, Relations between the concepts of the domain, Instances (specialization of concept), Constants, Attributes (properties of the concepts in general and instances in specification), formal axioms and rules specified in formal or semi-formal notation using DL. These components are used to conceptualize the ontologies by performing certain tasks as proposed by Methontology. 5.1 Term Terms are linguistic realizations of domain-specific concepts. Term extraction is a mandatory step for all the aspects of ontology learning from text. The methods for term extraction are based on information retrieval, NLP research and term indexing. The state-of-the art is mostly to run a part-of- speech tagger over the domain corpus and then to manually verify the terms hence constructing ad-hoc patterns. In order to automatically identify only relevant terms, a statistical processing step can be used that compares the distribution of terms between corpora. 5.2 Synonym Finding the synonyms allows the acquisition of the semantic term variants in and between languages and hence helps in term translation. The main implementation is by integrating WordNet for getting the English synonyms. This requires word sense disambiguation algorithms to identify the synonyms according to the meaning of the word in the phrase. Clustering and related techniques can be another alternative for dynamic acquisition. Two main approaches [6] are: 1. Harris Distribution Hypothesis: Terms are similar in meaning to the extent in which they share syntactic contexts. 2. Statistical information measures defined over the web. 5.3 Concept In identification of concept should focus to provide: 1. Definition of the concept. 2. Set of concept instances i.e. its extensions. 3. A set of linguistic realizations of the concept. Intentional concept learning includes extraction of formal and informal definitions. An informal definition can be a textual description whereas the formal description includes the extraction of concept properties and relations with other concepts. OntoLearn system can be used for this purpose. 5.4 Taxonomy Three main factors exploited to induce taxonomies are: 1. Application of lexico-syntactic patterns to detect hyponymy relations. 2. Context of synonym extraction and term clustering mainly using hierarchical clustering. 3. Document based notation of term subsumption. 5.5 Relation Relations represent a type of association between concepts of the domain. Text mining using statistical analysis with more or less complex levels of linguistic analysis is used for extracting relations.
  4. 4. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 104 | Page Relation extraction is similar to the problem of acquiring selection restrictions for verb arguments in NLP. Automatic content extractor program is one such program used for this purpose. 5.6 Rule These are used to infer knowledge in the ontology. The important factor for rule extraction is to learn lexical entailment for application in question answering systems. 5.7 Formal Axiom Formal axioms are the logical expressions that are always true and are used as constraints in ontology. The ontologist must identify the formal axioms needed in the ontology and should describe them precisely. Information like Name, natural language description and logic expression should be identified for each formal axiom. 5.8 Instance Relevant instances must be identified from the concept dictionary in an Instance table. NL tagger can be used in order to identify the proper nouns and hence the instances. 5.9 Constant Constants are numeric values that do not change during the time. 5.10 Attribute Attributes describe the properties of instances and concepts. They can be instance attributes or class attributes accordingly. Ontology development tools usually provide predefined domain-independent class attributes for all the concepts. VI. Ontology tools and frameworks Several tools and frameworks have been developed to aid the ontologist in different steps of ontology building. Different tools are available for extracting ontology components from different kinds of sources like text, semi structured text, dictionary etc. The scope of these tools varies from basic linguistic processing like term extraction, tagging etc to guiding the whole ontology building process. Some of the ontology tools and frameworks are discussed in the following section. As the scope of this study is limited to Text2Onto, we will discuss about it in detail. Other tools are presented briefly. VII. Text2Onto Text2Onto [7] is a framework for learning ontologies from textual data. It is a redesign of TextToOnto and is based on Probabilistic Ontology Model (POM) which stores the learned primitives independent of a specific Knowledge Representation (KR) language. It calculates a confidence for each learned object for better user interaction. It also updates the learned knowledge each time the corpus is changed and avoids processing it by scratch. It allows for easy combination and execution of algorithms as well as writing new algorithms. 7.1 Architecture and Workflow The main components of Text2Onto are Algorithms, an Algorithm Controller and POM. The learning algorithms are initialized by a controller which triggers the linguistic preprocessing of the data. Text2Onto depends on the output of Gate. During preprocessing, it calls the applications of Gate to i. tokenize the document (identifying words, spaces, tabs, punctuation marks etc) ii. split sentences iii. tag POS iv. match JAPE patterns to find noun/verb phrases Then the algorithms use the results from these applications. Gate stores the results in an object called Annotation Set which is a set of Annotation objects. Annotation object stores the following information: a. id - unique id assigned to the token/element b. type - type of the element (Token, SpaceToken, Sentence, Noun, Verb etc) c. features - a map of various info like whether it is a stopword or not, the category( or tag) of the element (e.g. NN), etc.
  5. 5. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 105 | Page d. start offset - Starting position of the element. e. end offset - ending position of the element. Text2Onto uses the „type‟ property to filter the required entity and then uses start and end offset to find the actual word. For e.g. suppose our corpus begins with the following line: Ontology evaluation is a critical task. . . Then the information of a word „task‟ is stored in Annotation object with type „Token‟, category „NN‟, start offset „34‟ and end offset „38‟. Text2Onto uses the offset values to get the exact word again. After preprocessing the corpus, the controller executes the ontology learning algorithms in the appropriate order and applies the algorithms‟ change requests to the POM. The execution of algorithms takes place in three phases notification phase, computation phase and result generation phase. In the first phase, the algorithm learns about recent changes to the corpus. In the second phase, these changes are mapped to changes with respect to the reference repository and finally, requests for POM changes are generated from the updated content of the reference repository. Text2Onto includes a Modeling Primitive Library (MPL) which makes the primitive models Ontology language independent. 7.2 POM POM (Probabilistic Ontology Model also called Preliminary Ontology Model) is the basic building block of Text2Onto. It is an extensible collection of modeling primitives for different types of ontology elements or axioms and uses confidence and relevance annotations for capturing uncertainty. It is KR language- independent and thus can be transformed into any reasonably expressive knowledge representation language such as OWL, RDFS, F-logic etc. The modeling primitives used in Text2Onto are as follows: i. concepts (CLASS) ii. concept inheritance (SUBCLASS-OF) iii. concept instantiation (INSTANCE-OF) iv. properties/relations (RELATION) v. domain and range restrictions (DOMAIN/RANGE) vi. mereological relations vii. equivalence POM is traceable because for each object, it also stores a pointer to those parts of the document from which it was derived. It also allows maintenance of multiple modeling alternatives in parallel. Adding new primitives does not imply changing the underlying framework thus making it flexible and extensible. 7.3 Data-driven Change Discovery An important feature of Text2Onto is data-driven change discovery which prevents the whole corpus from being processed from scratch each time it changes. When there are changes in the corpus, Text2Onto detects the changes and calculates POM deltas with respect to the changes. As POM is extensible, it modifies the POM without recalculating it for the whole document collection. The benefits of this feature are that the document reprocessing time is saved and the evolution of the ontology can be traced. 7.4 Ontology Learning Algorithms/Methods Text2Onto combines Machine Learning approaches with basic linguistics approaches for learning ontology. Different modeling primitives in POM are instantiated and populated by different algorithms. Before populating POM, the text documents undergo linguistic preprocessing which is initiated by the algorithm controller. Basic linguistic preprocessing involves tokenization, sentence splitting, syntactic tagging of all the tokens by POS tagger and lemmatizing by morphological analyzer or stemming by a stemmer. The output of these steps is an annotated corpus which is then fed to JAPE transducer to match a set of particular patterns required by the ontology learning algorithms. The algorithms use certain criteria to evaluate the confidence of the extracted entities. The following section presents the techniques and criteria used by these algorithms to extract different ontology components. 7.4.1 Concepts Text2Onto comes with three algorithms for extracting concepts EntropyConceptExtraction, RTFConceptExtraction and TFDIFConceptExtraction. It looks for the type „Concept‟ in the Gate results.
  6. 6. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 106 | Page All of these algorithms filter the same type. The only difference is the criteria they take for the probability / relevance calculation. These algorithms use statistical measures such as TFIDF (Term Frequency Inverted Document Frequency), Entropy, C-value, NC-value, RTF (Relative Term Frequency). For each term, the values of these measures are normalized to [0...1] and used as corresponding probability in the POM. 1. RTFConceptExtraction It calculates Relative Term Frequency which is obtained by dividing the absolute term frequency (number of times a term t appears in the document d) of the term t in the document d divided by the maximum absolute term frequency (the number of times any term appears the maximum number of times in the document d) of the document d. 𝑡𝐟(𝐭, 𝐃) = 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐭𝐞𝐫𝐦 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 𝐦𝐚𝐱𝐢𝐦𝐮𝐦 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐭𝐞𝐫𝐦 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 2. TFIDFConceptExtraction It calculates term frequency inverse document frequency which is the product of TF (term frequency) and IDF (Inverse Document Frequency). IDF is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the log of that quotient. tf-idf(t, d, D) = tf(t, d) × idf(t, D) where, 𝒊𝒅𝒇 𝒕, 𝑫 = 𝒍𝒐𝒈 𝑫 𝒅𝒇 𝒕 |D| = number of all documents df(t) = Number of documents containing the term. 3. EntropyConceptExtraction It computes entropy which is a combination of C-value (indicator of termhood) and NC-value (Contextual indicators of termhood) C-value (frequency-based method sensitive to multi-word terms) 𝐂− 𝐯𝐚𝐥𝐮𝐞 𝐚 = 𝐥𝐨𝐠 𝟐 𝐚 𝐟 𝐚 𝐢𝐟 𝐚 𝐢𝐬 𝐧𝐨𝐭 𝐧𝐞𝐬𝐭𝐞𝐝 𝐥𝐨𝐠 𝟐 𝐚 𝐟 𝐚 − 𝟏 𝐓𝐚 𝐟(𝐛) 𝐛𝛜𝐓𝐚 f(a) is the frequency of a, Ta is the set of terms which contain a. NC-value (incorporation of information from context words indicating termhood) 𝐰𝐞𝐢𝐠𝐡𝐭 𝐰 = 𝐭(𝐰) 𝐧 where t(w) is the number of times that w appears in the context of a term. 7.4.2 Instances An algorithm called TFIDFInstanceExtraction is available in Text2Onto for extraction of instances. It filters “Instance” type from the gate result and computes TFIDF as in TFIDFConceptExtraction. 7.4.3 General relations General relations are identified using linguistic approach. The algorithm SubcatRelationExtraction filters the types “TransitiveVerbPhrase”, “IntransitivePPVerbPhrase”, and “ TransitivePPVerbPhrase” in the Gate results which is obtained by shallow parsing to identify the following syntactical frames: • Transitive, e.g., love (subj, obj) • Intransitive + PP-complement, e.g., walk (subj, pp (to)) • Transitive + PP-complement, e.g., hit (subj, obj, pp (with)) For each verb phrases, it finds its subject, object and associated preposition. (By filtering Nouns and Verbs from the sentence) and then stems them and prepares the relation. 7.4.4 Subclass-of relations Subclass-of relations identification involves several algorithms which use hypernym structure of
  7. 7. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 107 | Page WordNet, match Hearst patterns and apply linguistic heuristics. The results of these algorithms are combined through combination strategies. These algorithms depend on the result of concept extraction algorithms. Relevance calculation of one of the algorithms is presented below: 1. WordNetClassifcationExtraction It extracts subclass-of relations among the extracted concepts identifying the hypernym structure of the concepts in WordNet. Relevance is calculated in the following manner: If a is a subclass of b, then 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞 = 𝐍𝐨. 𝐨𝐟 𝐬𝐲𝐧𝐨𝐧𝐲𝐦𝐬 𝐨𝐟 𝐚 𝐟𝐨𝐫 𝐰𝐡𝐢𝐜𝐡 𝐛 𝐢𝐬 𝐚 𝐡𝐲𝐩𝐞𝐫𝐧𝐲𝐦 𝐍𝐨. 𝐨𝐟 𝐬𝐲𝐧𝐨𝐧𝐲𝐦𝐬 𝐨𝐟 𝐚 7.4.5 Instance-of relations Lexical patterns and context similarity are taken into account for instance classification. A pattern- matching algorithm similar to the one use for discovering mereological relations is also used for instance- of relation extraction. 7.4.6 Equivalence and equality The algorithm calculates the similarity between terms on the basis of contextual features extracted from the corpus. 7.4.7 Disjointness A heuristic approach based on lexico-syntactic patterns is implemented to learn disjointness. The algorithm learns disjointness from the patterns like NounPhrase1, NounPhrase2.... (and/or) NounPhrasen. 7.4.8 Subtopic-of relations Subtopic-of relations are discovered using a method for building concept hierarchies. There is also an algorithm for extracting this kind of relationships from previously identified subclass-of relations. 7.5 NeOn Toolkit NeOn Toolkit is an open source multi-platform ontology engineering environment and provide comprehensive support for ontology engineering lifecycle. It is based on Eclipse platform and provides various plugins for different activities in ontology building. Following plugins are under the scope of this case study: 7.5.1 Text2Onto plug-in It is a graphical front-end for Text2Onto that is available for the NeOn toolkit. It enables the integration of Text2Onto into a process of semi-automatic ontology engineering. 7.5.2 LeDA Plugin LeDA, an open source framework for automatic generation of disjointness axioms, has been implemented in this plug-in developed to support both enrichment and evaluation of the acquired ontologies. The plug-in facilitates a customized generation of disjointness axioms for various domains by supporting both the training as well as the classification phase. 7.6 Ontocase OntoCase is an approach to use ontology patterns throughout an iterative ontology construction and evolution framework. In OntoCase the patterns constitute the backbone of these reusable solutions because they can be utilized directly as solutions to specific modeling problems. The central repository consists of pattern catalogue, ontology architecture and other reusable assets. The OntoCase cycle consists of 4 phases, Retrieval, Reuse, Evaluations and revision and Discovery of new pattern candidates. The first phase corresponds to input analysis and pattern retrieval. It constitutes the process of analyzing the input and matching derived input representation to the pattern base to select appropriate pattern. The second phase includes pattern specialization, adaptation and composition and constitutes the process of reusing the retrieved patterns and constructing an improved ontology. The third one concerns evaluation and revision of the ontology to improve the fit to the input and the ontology quality. The final phase includes the discovery of new pattern candidates or the other reusable components as well as storing pattern feedback.
  8. 8. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 108 | Page VIII. Learning disjointness axioms (LeDA) LeDA is an open-source framework for learning disjointness [3] and is based on machine learning classifier called Naive Bayes. The classifier is trained based on a vector of feature values and manually created disjointness axioms (i.e. a pair of classes labeled „disjoint‟ or „not disjoint‟). The following features are using in this framework: Taxonomic overlap: Taxonomic overlap is the set of common individuals. Semantic distance: The semantic distance between two classes c1 and c2 is the minimum length of a path consisting of subsumption relationships between atomic classes that connects c1 and c2. Object properties: This feature encodes the semantic relatedness of two classes, c1 and c2, based on the number of object properties they share. Label similarity: This feature gives the semantic similarity between two classes based on common prefix or suffix shared by them. Levenshtein edit distance, Q-grams and Jaro-Wrinkler distance are taken into account to calculate label similarity in LeDA. Wordnet similarity: LeDA uses Wordnet-bases similarity measure that computes the cosine similarity between vector-based representations of the glosses that are associated with the two synsets. Features based on Learned Ontology: From the already acquired knowledge such as terminological overlap, classes, individuals, subsumption and class membership axioms, more features, viz. subsumption, taxonomic overlap of subclasses and instances and lexical context similarity, are calculated. IX. LExO for Learning Class Descriptions LExO (Learning Expressive Ontologies) [3] automatically generates DL axioms from natural language sentences. It analyzes the syntactic structures of the input sentence and generates dependency tree which is then transformed into XML-based format and finally to DL axioms by means of manually engineered transformation rules. However, this automation of DL generation needs human intervention to verify if all of them are correct. X. Relexo Relational Exploration for Learning Expressive Ontologies is a tool used for the difficult and time-consuming phase of ontology refinement [4]. It not only supports the user in a stepwise refinement of the ontology but also helps to ensure the compatibility of a logical axiomatization with the user‟s conceptualization. It combines a method for learning complex class descriptions from textual definitions with the Formal Concept Analysis (FCA)-based technique of relational exploration. The LExO component of this assists the ontologist in the process of axiomatizing atomic classes; the exploration part helps to integrate newly acquired entities into the ontology. It also helps the user to detect inconsistencies or mismatches between the ontology and her conceptualization and hence provides a stepwise approximation of the user‟s domain knowledge. XI. Alignment To Top-Level Ontologies It is a special case of ontology matching where the goal is to primarily find correspondences between more general concepts or relations in the top-level ontology and more specific concepts and relations on the engineered ontology. Aligning Ontology to a top-level ontology might also be compared to automatically specializing or extending a top-level ontology. Methods like lexical substitution may be used to find clues of whether or not a more general concept is related to a more specific one in the other ontology the alignment of ontology to a top-level ontology engineering patterns. By determining that a pattern can be applied and applying it then provides a connection to the top-level ontology. XII. Experiment In order to evaluate the results of Text2Onto and improve them, some experiments were carried out. The objectives of the experiments were • To analyze the various algorithms and criteria used by Text2Onto for extracting different ontology components. • To analyze the result produced by Text2Onto • To compare the components extracted by Text2Onto with the ones extracted manually. • To analyze errors found in the ontology built by Text2onto and identifying their origin. • To analyze Text2Onto outcomes when adding meta-model of the ontology as an additional input. Details on the experimental data and the experiment protocol are presented in the following sections.
  9. 9. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 109 | Page XIII. Experimental Data The experiments were conducted for three individual texts. The first text which we will call „Abstract‟ onwards was a compilation of abstract of four different papers. The remaining texts will be referred to as „Text1‟ and „Text2‟. All of these texts were related to Ontology building and ontology learning tools. Ontologies were built manually from these texts as well as from Text2Onto. XIV. Experimental Protocol The experiments were performed in five phases. The first phase involved the building of ontology manually from the three texts. The second phase was concerned with the development of ontology using Text2Onto. In the third phase, the ontology built by Text2Onto was compared with the manual one. In the next phase, meta-model of the texts were fed to Text2Onto and the corresponding ontology was built again. Finally, the results were compared with the older ontologies. These phases are further described in details in the following section: 14.1 Experimental Work-flow The following steps were carried out for each text: 1. Building ontology manually Methontology was followed to build ontologies from the three texts manually. All the steps like glossary building, meta-model and taxonomy were followed while building ontology from Abstract and Text2 whereas the ontology of Text1 was provided to us. The ontology was conceptualized in the following way: 1. POS tagging of all the terms in the document. 2. Identify the concepts and relation from the validated terms. 3. Making the meta-model. The aim is to subsume all the accepted concepts into some of the core concepts. 4. Identifying the accepted terms (concepts), their related core-concepts and finding their synonyms. 5. Defining the is-a hierarchy for the concepts and the identified core-concepts. 6. Identifying other binary relations. 7. Validating the meta-model. 2. Building ontology using Text2Onto This step involved the use of Text2Onto to build the same ontology automatically. 3. Analysis of Text2Onto results The Analysis phase was itself done in two phases. First, the results of different algorithms of Text2Onto were compared with each other in order to find the interesting criteria for the extraction of different components. This was done for concepts, instances, relation and hierarchy extraction. The main criteria for the comparison were the relevance value. Secondly, a comparison and study of differences between the results of tasks performed in the previous two phases were carried out to estimate and comment on the quality of the ontology built by the tool. The comparison was very detailed in the sense that all concepts, instances, relations and hierarchies extracted from these two methods were compared. It was followed by the identification of causes for the differences and errors/shortcomings in the performance of the tool. 4. Adding Meta-model to the ontology using Text2Onto The idea was to observe if Text2Onto gives better results when ontology is built on top of its meta- model. For this, the meta-model built manually in the first phase was introduced into Text2Onto and ontologies were built upon their corresponding meta-model. This process involved the following steps: (a) Conversion of the meta model into text In order to get a POM of meta-model, we converted meta-model into text from which Text2Onto can extract core concepts and relations between them. Details about the process of conversion are given in the section 16Conversion of Meta-Model to text. (b) Obtaining meta model POM The meta model text was fed to Text2Onto to obtain a meta model POM which contained all core concepts and relations between them. (c) Improving the ontology using meta-model Once the POM has been obtained from Text2Onto, the original text was added to it to build a new ontology combined with the meta model.
  10. 10. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 110 | Page 5. Comparison of the ontology built with and without the meta model In this phase, the ontology build in the second phase was compared with the one built using meta model. Relevance values, identification of new components and hierarchies were considered while comparison. XV. Results And Observations 15.1 Comparison of Algorithms and criteria of Text2Onto The algorithms and criteria used by Text2Onto for extracting ontology components were studied in detail so as to compare their performance. The comparison was done based on the relevance values computed by these algorithms. 15.1.1 Observations Though the values of relevance in case of entropy are different from those in case of other algorithms, they hold the similar relations and the relative values for the concepts. Same is also true with the combination of one or more such evaluation algorithms. It was observed that the order of the extracted components is independent of the algorithms/criteria used. So we cannot say if one algorithm is superior to the others or one criterion is better than the others. We observed the same behavior in all three texts. XVI. Conversion Of Meta-Model To Text In order to try to improve the ontology built by the tool Text2Onto, the meta-model is used and is translated to text. As concepts and relations of meta-model should be all identified when executed with the tool, first try was to write a paragraph about the meta-model. This worked fine for most of the concepts but a very few relationships could be identified and some of the concepts were also left out and some extra concepts were included (which were used in the paragraph to structure the meta-model tran slation ). The next try was to write simple sentences consisting of two nouns (the concepts) related by a verb (the relation between the two concepts). We tried to use the core concepts and relations only from the text as much as possible. However, this also could not identify all the relations properly. Finally a new algorithm was proposed so as to achieve the desired goal as well as to enhance the results of Text2Onto. Below are the translations of meta model for the various experimental data used. 16.1 AbstractText The meta model of this text is given in the figure 1. For this meta model, we used the following lines to construct meta model POM in Text2Onto. A system is composed of methods. A method has method components. A tool implements methods. An algorithm is used by methods. An expert participates in ontology building step. Ontology building step uses resources. A resource is stored in data repository. A term is included in resources. Ontology building step is composed of ontology building process. Ontology has ontology components. A user community uses ontologies. Ontology describes domain.
  11. 11. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 111 | Page Figure 1: Abstract-Text Meta Model 16.2 Text1 The meta model of this text is given in the figure 2. Figure 2: Text1 Meta Model 16.3 Text2 The meta model of this text is given in the figure 3 and the corresponding meta-model text is given below. Domain has ontology. Ontology is composed by ontology components. Ontology is built by methodology. Tool builds ontology. Activity is guided by methodology. Activity produces model. Representation is resulted by mode Tool supports activity. Organization develops tool. Methodology is developed by organization. Tool uses language.
  12. 12. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 112 | Page Person uses tool. Person creates ontology. Figure 3: Text2 Meta Model 16.4 Comparison of Manual and Automated Ontologies This sections includes the comparison of the two methods of ontology building i.e. MANUAL and AUTOMATED with the tool Text2Onto. The aim of the comparison is to evaluate the process of ontology building by the tool and then analyze the results to suggest improvements to the tool. 16.4.1 Manual Ontology - Abstract Abstract text was the shortest of all texts. It had 536 terms in total out of which 34 terms were accepted as concepts and 9 as instances. 16.4.2 Automated Ontology - Abstract The same text was fed to Text2Onto for automating the process of ontology building. As the importance of ontology components based on relevance values was found to be independent of the algorithms used, we could choose any algorithm from the available list of them. As we were extracting ontology from a single document, the algorithms that use TFIDF criteria was not interesting for us. So, we didn‟t choose this algorithm during analysis. The evaluation algorithms used in the Text2Onto gave the relevance values to the concepts and other components identified. Text2Onto did not support writing the results in a separate file and hence we added another method that could save the results in a different excel file for each execution of Text2Onto. This was also necessary for the later phases of comparison. Text2Onto extracted 85 concepts, 14 individuals, and 3 general relations. 16.4.3 Comparison of manual and automated ontology - Abstract The two ontologies were compared majorly based on the identified concepts, instances, and relations. Out of 34 concepts extracted manually, only 26 matched the ones extracted from Text2Onto. Only 7 instances were common to both ontologies and none of the relations were common to them. We observed that the manual ontology was better in identifying the concepts because in the ontology made by Text2Onto some of the irrelevant concepts were also considered. Another major problem was the identification of the composite concepts. All the composite concepts (consisting of more than one atomic word) were not identified unlike the manual ontology. Relations were not at all satisfactory. The possible reasons attributed for these differences are as follows:
  13. 13. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 113 | Page 1. The text was not consistent as a whole. The text was basically a summarization of different texts and hence it lacked synchronization between its different paragraphs. Thus there was a need to try with another longer and better text so as to conclude anything significant. 2. The frequency for most of the terms (concepts and relations) was very less. 16.4.4 Manual ontology - Text1 For this ontology, there were 4807 terms after tokenization, of which, 472 were nouns and 226 were verbs. After performing the operation of stemming, the number of nouns was reduced to 357 as close as 25% reduction in comparison with the original count. 16.4.5 Automated ontology - Text1 The Text1 was fed to Text2Onto for making the ontology automatically. 406 concepts, 94 instances and 16 relations were extracted from Text2Onto. 16.4.6 Comparison of manual and automated ontologies - Text1 As compared to 357 terms from the manual ontology, Text2Onto extracted 406 terms. Among them only 87 concepts were common to both of them. Some highly irrelevant terms were also included in the results of Text2Onto based on their high relevance values. On the other hand, some important composite terms were missed out from the results of automated ontology. 16.4.7 Manual ontology - Text2 Following the same procedure as above for building the manual ontology, there were 4761 terms in the knowledge base. Finally 667 valid terms were refined from this knowledge base of which ultimately 200 terms were accepted as concepts of the ontology. 16.4.8 Automated ontology - Text2 350 terms (concepts) were extracted from this text when it was run with Text2Onto. A lot of concepts were insignificant and had to be rejected when the comparison was made. 16.4.9 Comparison of Manual and Automated Ontologies This automated ontology was better than the earlier too as it could identify many relations and the is-a hierarchy was better than the others. 16.4.10Observations Relevance Values and their roles In order to assess the result of Text2Onto and possibility to automate the process of ontology building, we examined the role of relevance values for concepts in Text2Onto. The following observations were made regarding the same:  Most of the terms that were extracted by Text2Onto as concepts can be accepted based on their relevance values.  The core concepts generally have very high relevance.  Most of the terms with high relevance value are accepted.  There are concepts which are always rejected despite of their very high values. After studying man y papers and previous works in this field, there is no general rule that can be applied to automatically reject these terms but some corpus specific rules can be written.  There are concepts which are accepted despite of their low values. In order to automate the third and fourth process, we tried to find out some information about these kinds of concepts. We observed that the terms with high relevance values (which are generally rejected) occur in the same kind of pattern. For example the concept is „ORDER‟. It is generally observed to appear a s “IN ORDER T O”. Thus predefining many such patterns to exclude can be one solution to reject some terms despite their high relevance values. 16.5 Analysis of errors 16.5.1 Identification of errors Following errors were identified while comparing the ontologies built manually and the ones built usingText2Onto: 1. Some concepts were also identified as instances by Text2Onto. For e.g. ontology, WSD 2. Acronyms were not identified by Text2Onto. E.g. SSI, POM.
  14. 14. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 114 | Page 3. Synonyms were not identified properly. 4. Very few relations wer e identified by Text2Onto most of which were not appropriate (interesting) at all. 5. Instance-of algorithm did not give the instances that are given by instance algorithm. 6. Some verbs like extract and inspect which we had considered as relations were identified as concepts by Text2Onto. 16.5.2 Identification of causes of errors After an in depth study of the algorithms of Text2Onto, following causes of errors were observed: 1. POS tagger used by GATE tags some words incorrectly. For e.g. the verb extract was tagged as noun. 2. Errors may also be due to grammatical mistakes in the corpus file. 3. In the case of Abstract text, er r or s may also be due to its length and content. The text con tain ed 4 paragraphs from different papers, and hence had few common terminologies. 4. The algorithms t o extract concepts and instances work independently. Thus, identification of a term as both concept and instance is not handled in Text2Onto. 5. SubcatRelationExtraction algorithm can extract relations from simple sentences only. The patterns it can identify are: Subject + transitive verb + object Subject + transitive verb + object + preposition + object Subject + intransitive verb + preposition + object It identifies only those verbs as relations which come with a singular subject (concept). For e.g. it can extract the relation build from a tool builds ontology but not from Tools build ontology. XVII. Improvement Of Text2Onto Results As the result of Text2Onto was not good compared to manual ontology, we did two things to improve it. First, we added an algorithm to improve relation e x t r a c t i o n of Text2Onto. Second, we performed some experiments on Text2Onto adding meta model to the ontologies built above. The following section describes the added algorithm and the results and observations from the experiment. 17.1 Algorithm to improve Text2Onto results The relations extracted from Text2Onto were not interesting at all. Moreover, we found it difficult to make Text2Onto extract all the relations from Meta model text. So, we decided to add an algorithm to improve the result of relation extraction in Text2Onto. To extract more relations in order to make a better meta-model, we have added two JAPE rules along with an algorithm to process them. The added JAPE rules identify sentences in passive voice and sentences with more than one verb (one auxiliary verb followed by a main verb) with preposition, i.e. the following syntactical patterns: • Subject + be-verb + Main verb +”by” + Object e.g. Ontology is built by experts • Subject + auxiliary-verb + Main verb + preposition + Object e.g. Ontology is composed of components Though these patterns are similar to each other, we added two patterns instead of one in order to identify these grammatically significant patterns separately. The new algorithm c a n find these patterns from both meta-model and the ontology text. As a result, we could obtain the relations that were not identified in the text earlier. The added JAPE expressions are as below: R u le: Passive Phrase ( ({Noun Phrase} | {Proper Noun Phrase}): object {SpaceToken. kind = = space} ({Token. category = = VBZ} | {Token. strings == is}): auxverb {Space Token. kind = = space} ({Token. category = = VBN} | {Token. categories = = VBD}): verb {SpaceToken. kind = = space} ({Token .string = = by}): prep {SpaceToken. kind = = space}
  15. 15. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 115 | Page ({NounPhrase} | {Proper Noun Phrase}): subject ): passive −−> : Passive. Passive Phrase = { rule = ” Passive Phrase "}, : Verb. Verb = {Rule = “Passive Phrase "}, : Subject .Subject = {Rule = " Passive Phrase "}, : object .Object = {Rule = "Passive Phrase "}, : prep. Preposition = {Rule = "Passive Phrase "} R u le: Multi Verbswith Prep ( ({NounPhrase} | {Proper Noun Phrase}): subject {Space Token. kind = = space} ({Token. category = = VBZ} {Token. category = = VB}) : auxverb {SpaceToken. kind = = space} ({Token. category = = VBN} | {Token. categories = = VBD}): verb {SpaceToken. kind = = space} ({Token. category = = IN}): prep {SpaceToken. kind = = space} ({NounPhrase} | {Proper Noun Phrase}): object ): mvwp −−> : mvwp. MultiVerbswith Prep = {Rule = "MultiVerbswith Prep"}, : Verb. Verb = {Rule = "Multi Verbswith Prep"}, : Subject. Subject = {Rule = "MultiVerbswith Prep"}, : object. Object= {Rule = " MultiVerbswith Prep"}, : prep. Preposition = {Rule = " MultiVerbswith Prep"} These JAPE expressions are used by GATE application to match the syntactical patterns. Using the new algorithm, we could extract more relations from the original text. 17.2 Enhancement of Ontology using Meta-Model The main idea was to try to improve the results of Text2Onto so that the process of building Ontology can be automated. For this first of all, the text was fed to Text2Onto and shortcomings were identified. Now in order to overcome this, we thought of feeding the meta model to it so that we can obtain better extraction of concepts, relations and taxonomy. The experiment was carried out for the three text document. Results obtained from the text were compared with the results obtained from meta model plus the text to assess the improvement of Text2Onto results. 17.2.1 Observations Following observations were made when meta-model and ontology were used on same POM to make the ontology: 1. All the core concepts were identified and their relevance was increased. (The c o r e concepts w e r e identified earlier also) 2. The core concepts which are not present in the text had greater values. 3. The relations from the meta-model are identified and included in the ontology. Due to addition of more patterns, some more relations are identified form the text. However, the useful relations are limited to core concepts.
  16. 16. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 116 | Page 4. Hierarchy does not seem to be improved with the algorithms. VerticalRelationsConceptClassification and PatternConceptClassification. Rather, core concepts with composite terms are further classified by these algorithms. For e.g. Ontology component w a s classified under Component. We have not checked this with WordnetConceptClassificationalgorithm yet as it give lots of irrelevant subclass of relations. From these behaviors, we can present the following ideas of making meta model: • We can make meta model with the terms not present in the text (point 2) • If terms present in the text are used for making meta-model, we can write try to increase the frequency of core concepts in the meta model itself. (Point 1) • We can avoid composite terms in meta-model as much as possible. (Point 4) XVIII. Conclusion We studied the architecture and working of a tool called Text2Onto that extracts ontologies from textual input and analyzed its results conducting some experiments with three texts. As a part of the experiments, ontologies were built manually a s well as using the tool and they were compared with each other. After a detailed analysis of the results, we reached the final conclusions as follows: 1. Relevance measure cannot be a general measure to reject or accept all the terms. In automated ontology, there are several terms that have high relevance values and are still rejected by the experts because they do not hold importance for the ontology. Also there are terms which, even after having a significantly low value of relevance, are accepted. This is also very common with the core concepts. Hence the idea of directly using relevance values for accepting or rejecting concepts needs some further refinement. 2. Meta-Model could not improve the ontology in terms of its is-a hierarchy. Though meta model increased the relevance value of core concepts, is-a hierarchy was not improved. Even after having more extracted relations and properly identified core-concepts using the meta-model, it could not help in making the hierarchy better. Identifying the relations and concepts has no effect on subclassof algorithm results. As stated above, there are a few refinements that can be done for the same. They are suggested in the next section of the report. XIX. Future Work From the study of Text2Onto and the outcome of the analysis of its results, we could suggest the following future work and enhancement to Text2Onto. 1. Enhance the use of meta-model to modify the is-a hierarchy of the Ontology. After adding corpus to the upper ontology (using the meta-model), we should increase the relevance of values of the concepts that were identified only for the upper ontology because those core concepts may not be frequent or very relevant. 2. We can try to manually include the following kind of hierarchy in the Ontology Text2Onto uses the following concept while extracting relations: If A<is related to>B and C <is related to>D then A <is related to>D and C <is related to>B also. This kind of relation str uctur e can be exploited to improve the hierarchy o f concepts. If A <related to>B and C <related to>D, then C, D can be considered to be a subclass of A and B respectively. Though this idea may not be applicable for all relations, we can enhance the meta-model significantly for some relations with same name. 3. Another algorithm can be added where some of the “unwanted” domain-concepts can be predefined and hence avoided to be included in the ontology. This task will require human interaction before starting to build the ontology because the “interestingness” of the concepts is significantly dependent on the domain. A similar approach can be followed for the “infrequent” and “significant” concepts of a particular domain. These two approaches can lead us to use relevance measure as significant criteria to accept or reject a term. Hence the problem of difference in the concepts between manual and automated ontology can be overcome.
  17. 17. Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto 117 | Page 4. As the algorithms a r e executed separately, some terms are identified as both concepts and instances. A feature (or post-processing) can be included so that the terms should either be listed as concepts or as individuals but not as both. Post processing is also required to remove unnecessary or irrelevant subsumption relation. Synonyms can be taken i n t o account to improve the result of subsumption algorithm. 5. A module can be added to identify the acronyms. Examples fr om the text POM and “probabilistic ontology model” should be identified as one Term. References [1] Mariano Fernandez, Asuncion Gomez-P´erez, and Natalia Juristo. Methontology: From ontological art towards ontological engineering. 1997. [2] Tom Gruber. What is ontology? 1992. [3] Volker J. Prototype for learning networked ontologies, deliverable d3.8.1 of neon project. 2009. [4] Volker Johanna and Blomqvist Eva. Evaluation of methods for contextualized learning of networked ontologies. D eliverable d3.8.2 of neon project. 2008. [5] Corcho O., Fernandez-Lopez M., Perez A. G., and Lopez-Cima A. Building legal ontologies with methontology and webode. Pages 142–157, 2003. [6] Buitelaar P., Cimiano P., and B. Magnini. Ontology learning from text: an overview. Ontology Learning from Text: Methods, Applications a n d Evaluation, pages 3–12, 2005. [7] Cimiano P. and Volker J. Text2onto - a framework for ontology learning and data-driven change discovery. 2005.