Nature-inspired methods for the Semantic Web


Published on

Methods inspired from nature for the Semantic Web.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Nature-inspired methods for the Semantic Web

  1. 1. Nature-Inspired Methods for the Semantic Web Claudiu Mih˘il˘ and Magdalena Jitc˘ a a a Faculty of Computer Science, ”Al.I. Cuza” University of Ia¸i, s 16, G-ral Berthelot Street, 700483 Ia¸i, Romania s {claudiu.mihaila, magdalena.jitca} Abstract. More recently, significant research efforts are made towards uncertainty representation and reasoning in ontologies for the Semantic Web. This work reports on the contributions using methods inspired from nature in multiple Semantic Web domains, such as information retrieval and extraction, clustering, and personalisation. Furthermore, it describes briefly the attempts of modelling uncertainty. Key words: semantic Web, methods inspired from nature, soft com- puting, Web mining, uncertainty modelling 1 Introduction In the context of an ever-expanding World Wide Web (www), more than 100 million registered domains [1], over 25 billion indexed pages [2], and more than one trillion unique urls [3] have been reported. The variety of information avail- able on the web has led the researchers to multiple research directions, one of the most important being related to the difference between human- and machine- understandable information and another related to information uncertainty. The Semantic Web models available until the past few years have included little ex- plicit information about uncertainty representation and processing because of the concerns raised by the scalability and computational complexity of this pos- sible approach. Much research interest focusses on the techniques for extracting incomplete, partial or uncertain knowledge, as well as on handling uncertainty when representing extracted information using ontologies. This report provides an overview of the contributions to this research area regarding the development or improvement of the currently available Semantic Web tools and models by means of soft computing. It also presents the work dealing with representation of uncertain knowledge and reasoning in presence of uncertainty. In the near future, semantic web systems are expected to integrate a consis- tent set of the available soft computing techniques, including uncertainty repre- sentations, statistical measures, fuzzy rules or belief networks for transmission across the Web.
  2. 2. In the first part of the report, we describe the uses of nature-inspired methods in the Web and then in the Semantic Web. In the second part, we describe the attempts of modelling uncertainty. 2 Current use of nature-inspired methods in the Web Due to the vastness and diversity of the Web, it has become impossible to be able to create software which comprises it completely and which is able to understand correctly the information it contains. The lack of structure and patterns and the large amount of data has led researchers into developing nature inspired methodologies, which can find, most of the times, an optimal solution to NP- complete problems. Methods inspired from nature are used in various Web domains. For example, SnapAd.com1 uses genetic algorithms to produce advertisements. This service begins with a base population of ad variations and, after employing the genetic algorithm, manages to select their best-performing characteristics in order to create an impressive result. Other works, such as [4, 5], use genetic algorithms to determine clusters of similar users in social networks. The algorithms use fitness functions which mea- sure the number of intra- and inter-connections for groups and variation opera- tors which reduce the space of possible solutions in an appreciable manner. In addition, nature inspired methods have been successfully used in search engines [6], information retrieval [7], and question answering [8] systems. 3 Nature-inspired methods in the Semantic Web Web mining is the area of data mining which deals with the analysis and ex- traction of interesting knowledge from the World Wide Web. However, when working with large amounts of mixed and poorly tagged information, which is constantly changing, problems are very likely to arise. According to [9], the main problems regard handling context sensitive queries, summarisation, deduction, personalisation and learning. Fig. 1 depicts the subtasks of web mining, which will later be discussed along with the problems they might raise. Fig. 1. Web mining subtasks 1
  3. 3. Information retrieval The issues which may occur during the task of infor- mation retrieval (ir) are related to the uncertainty and the accuracy of the user queries, as well as to the deduction and decision capabilities of the system. Sev- eral approaches of the fuzzy logic which try to solve the issues of formulating queries in relation to the relevance of the resulting documents with respect to the input query are included in [9]. The results show that systems based on fuzzy Boolean ir models would be most suitable for representing both the document contents and the information needs. Artificial neural networks (ann) also provide a convenient method of knowl- edge representation for ir applications, as their learning ability eases the task of implementing adaptive systems. The system [10] first encodes the initial knowl- edge base, and then constantly refines it by means of the neural networks. The advantages of this approach is that the correctness of the initial information does not directly influence the output, as this information is improved at each step by extracting rules from the knowledge-based nns. The genetic algorithms (ga) that have been used for this purpose assign so-called relevance coefficients to the html tags, which are deduced from the training text set. As regards the sub-task of query optimisation, gas have been used at reweighting the document indexing without having to expand the queries [11]. A novel approach using evolutionary algorithms in a distributed environment is reported in [12]. Their intention is to determine to which information sources the queries should be sequentially sent. By combining a query sampling method and an evolutionary method, the resource descriptions are retrieved and inte- grated optimally. The process of ontological mediation with query-based sam- pling is depicted in Fig. 2 [13]. While the crawlers sample the resource descrip- tions of the information sources, the mediator conducts the process of ontological mediation for the integration of the obtained ontologies into a single large one [14]. Fig. 2. A whole process of ontological mediation with query-based sampling. [13] Moreover, due to the fact that crawlers continue obtaining semantic informa- tion from the sources, the ontologies evolve over time. This process is achieved by employing a genetic algorithm within the mediator, which determines the best mapping between the obtained semantic substructures and the estimated
  4. 4. local ontology. The results of the conducted experiments prove the scalability of the entire contextual mediation. Another technique that can be used to solve the task of approximate infor- mation retrieval is the rough sets (rs) theory [9], considering that the set of relevant documents may be less accurate and that it can be represented by its ”upper” and ”lower” approximations. The lower one corresponds to the most specific set, that is definitely relevant to the searched item and the upper one refers to the most general set that may possibly be relevant. This concept can further be used at improving the efficiency of ir systems by implementing a dynamic and focused search, based on the above described technique. Information extraction Information extraction (ie) is the task of identifying specific fragments of a single document representing its core semantic content. The most effective methods of ie discovered until now involve working with wrappers, procedures for extracting information from web resources. However, they have the drawback of being particular to a certain resource, hence they cannot be applied on every available web resource. This performance can be improved by using nns with a boosted wrapper induction (bwi) technique [15]. By using the AdaBoost algorithm, bwi repeat- edly reweights the training examples so that subsequent patterns handle training examples missed by previous rules. The results of the learning process are com- parable to the ones obtained with the hmm technique for learning and then extracting the information [16]. Another approach is that of Inductive Logic Programming [17], in which logical rules are learned in order to identify phrases to be extracted from a document [18]. Clustering Clustering is an important issue while dealing with web documents in order to cover tasks such as measuring the relevance or the speed, obtaining browsable summaries or working with overlapping data. However, there are still some unresolved problems regarding efficient clustering arising from the nature of web data itself. A fuzzy clustering technique for web log data mining, based on an algorithm for clustering user session, is presented in [9]. It analyses the structure of a certain website and the urls in order to be able to compute the degree of similarity between two user sessions. The ability of nns in modelling complex nonlinear functions can also be used for this task [9], for example in classifying web pages, as well as user patterns, in both supervised and unsupervised manners. Another soft computing method used for document clustering is rs theory, among which variable precision and tolerance relations are significant for this task. In particular, rough mereology has been used for mining multimedia ob- jects, as well as web graphs or semantic structures [19]. An evolutionary approach for the conceptual clustering of semantic knowl- edge bases is presented in [20]. Their method can be applied to multi-relational
  5. 5. knowledge bases to exploit effectively and, most importantly, language-indepen- dently a semi-distance dissimilarity measure defined for the space of individual resources. Such clusterings of semantically annotated resources present a high degree of interest due to their ability of defining new emerging concepts (con- cept formation), which can induce new concept definitions or a refinement of existing ones (ontology evolution). The evolutionary algorithm they developed, which extends distance-based clustering procedures employing medoids as cluster prototypes, remains stable along multiple repetitions, converging towards clus- terings of comparable quality with generally the same number of clusters, and avoiding being caught in points of local minima. Furthermore, the work could be extended in order to create hierarchies of clusters of specific granularity. Personalisation Personalisation involves using the technology to accommodate the differences between individuals, but in this context it refers to the fact the retrieved content and the search results should be according to users’ preferences and interests. The most effective way of learning the user profiles by using train- ing data collected from several users or systems. ”Syskill and Webert”, an agent which learns user profiles using the Bayesian classifier, is introduced in [9]. As an extension, it can be used to determine whether the users would have interest for a similar page. This decision is possible due to analysing the html source of a page, but the prerequisite for this is the previous retrieval of the considered page. An improved way of obtaining quality and useful ”aggregate user profiles” from patterns is given in [21]. This approach relies on two techniques involving clustering of both user transactions and page views with the purpose of obtaining the overlapping aggregate profiles, which can later be used by recommender systems for real-time personalisation. 3.1 Uncertainty modelling The issue of uncertainty on the Semantic Web is still a challenging research field, as this domain deals with imprecise information from different applications, each with its special knowledge representation needs (e.g., multimedia processing, face recognition, gps systems). To deal with uncertainty in the Semantic Web and its applications, many researchers have proposed extending owl and the Description Logic (dl) formalisms with special mathematical frameworks. A probabilistic method, based on Bayesian networks (bn), is proposed in [22], to represent and compute the overlap in concept hierarchies. The overlap between a pair of concepts (selected vs. referred) is a numeric value in the [0, 1] range and indicates how well a data item matches the query concept. It approaches 0 in case of disjoint concepts and 1 when the referred concept is subsumed by the selected one. Based upon the possible relations between concepts a graph notation has been used for representing the degree of overlap in the concept hierarchy. The goal of this approach is to represent the overlap between concepts from a taxonomic structure, without requiring the user any prior knowledge of probability theory or bns.
  6. 6. A probabilistic framework for modelling uncertainty in semantic web ontolo- gies based on Bayesian networks has been developed in [23]. Their goal is to convert any owl ontology into a bn by using probabilistic extensions to de- scription logics. The translated bn is semantically consistent with the original ontology and satisfies all the given probabilistic constraints. The drawback of this approach is that the probabilistic information must be added to the on- tology by the human modeller and this task requires knowledge of probability theory. This framework, called BayesOWL, is currently at version 1.0, and it is available for download2 as a Java extension. More recently, a World Wide Web Consortium (w3c) Incubator Group on Uncertainty Reasoning for the World Wide Web was created in order to describe situations where uncertainty reasoning would improve majorly information ex- traction, to identify methodologies which can be applied to these cases, and to develop a standardised representation of uncertainty [24]. The most commonly used approaches to uncertainty for the www that the group identified are prob- abilistic theories (e.g., bn), fuzzy logic, and belief functions. After analysing 16 use cases, the group developed an uncertainty ontology and concluded that the uncertainty came either from data, or from reasoning. 4 Conclusions In this report, we have summarised the achievements using soft computing methodologies in the context of the Semantic Web and briefly described their principles. We have then summarily introduced uncertainty modelling and gave an overview of some approaches. Many important aspects still remain open for future research. Specifically, there is a need for scalable formalisms to support uncertainty and vagueness in ontology languages, and implementations of these formalisms. References 1. DomainTools, LLC: Domain Counts & Internet Statistics. Accessed 10 January 2010. 2. de Kunder, M.: The size of the World Wide Web. Accessed 10 January 2010. 3. Alpert, J., Hajaj, N.: We knew the web was big... (25 July 2008) Accessed 10 January 2010. 4. Pizzuti, C.: Community detection in social networks with genetic algorithms. In: GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolution- ary computation, New York, NY, USA, ACM (2008) pp. 1137–1138 5. Lipczak, M., Milios, E.: Agglomerative genetic algorithm for clustering in social networks. In: GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, New York, NY, USA, ACM (2009) pp. 1243–1250 2˜ypeng/BayesOWL/
  7. 7. 6. Picarougne, F., Monmarch, N., Oliver, A., Venturini, G.: Geniminer: Web mining with a genetic-based (2002) 7. Xu, Y., Deli, Y., Yu, L.: Efficient annealing -inspired genetic algorithm for in- formation retrieval from web-document. In: GEC ’09: Proceedings of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computation, New York, NY, USA, ACM (2009) pp. 1017–1020 8. Figueroa, A.G., Neumann, G.: Genetic algorithms for data-driven web question answering. Evolutionary Computation 16(1) (2008) pp. 89–125 9. Pal, S.K., Talwar, V., Mitra, P., Member, S., Member, S.: Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Transactions on Neural Networks 13 (2002) pp. 1163–1177 10. Shavlik, J., Towell, G.G.: Knowledge-based artificial neural networks. Artificial Intelligence 70(1/2) (1994) pp. 119–165 11. Yang, J.J., Korfhage, R.R.: Query modification using genetic algorithms in vector space models. International Journal of Expert Systems 7(2) (1994) pp. 165–191 12. Jung, J.J.: An evolutionary approach to query-sampling for heterogeneous systems. Expert Systems with Applications 37(1) (2010) pp. 226–232 13. Jung, J.J.: Ontological framework based on contextual mediation for collaborative information retrieval. Information Retrieval 10(1) (2007) pp. 85–109 14. Noy, N.F., Musen, M.A.: Prompt: Algorithm and tool for automated ontology merging and alignment. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press / The MIT Press (2000) pp. 450–455 15. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press / The MIT Press (2000) pp. 577–583 16. Bikel, D.M., Schwartz, R., Weischedel, R.M.: An algorithm that learns what‘s in a name. Machine Learning 34(1-3) (1999) pp. 211–231 17. Muggleton, S., ed.: Inductive Logic Programming. Academic Press, New York, NY (1992) 18. Freitag, D.: Toward general-purpose learning for information extraction. In: Pro- ceedings of the 17th international conference on Computational linguistics, Mor- ristown, NJ, USA, Association for Computational Linguistics (1998) pp. 404–408 19. Polkowski, L., Skowron, A.: Rough mereology: A new paradigm for approximate reasoning. Int. J. Approx. Reasoning 15(4) (1996) pp. 333–365 20. Fanizzi, N., d’Amato, C., Esposito, F.: Evolutionary conceptual clustering based on induced pseudo-metrics. International Journal on Semantic Web & Information Systems 4(3) (2008) pp. 44–67 21. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggre- gate usage profiles for web personalization. Data Min. Knowl. Discov. 6(1) (2002) pp. 61–82 22. Holi, M., Hyv¨nen, E. In: Modeling uncertainty in semantic web taxonomies. o Springer-Verlag, Berlin (2006) 23. Ding, Z., Peng, Y.: A probabilistic extension to ontology language owl. In: HICSS ’04: Proceedings of the Proceedings of the 37th Annual Hawaii International Con- ference on System Sciences (HICSS’04) - Track 4, Washington, DC, USA, IEEE Computer Society (2004) p. 40111.1 24. W3C Incubator Group Report: Uncertainty Reasoning for the World Wide Web. (31 March 2008) Accessed 10 January 2010.