Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic Web Nature


Published on

The Semantic Web is a vision of information that is understandable by computers. Although there is great exploitable potential, we are still in "Generation Zero'' of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most important aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, artificial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Semantic Web Nature

  1. 1. Nature Inspired Methods for the Semantic Web Monica Macoveiciuc and Constantin Stan Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. The Semantic Web is a vision of information that is under- standable by computers. Although there is great exploitable potential, we are still in “Generation Zero” of the Semantic Web, since there are few real-world compelling applications. The heterogeneity, the volume of data and the lack of standards are problems that could be addressed through some nature inspired methods. The paper presents the most im- portant aspects of the Semantic Web, as well as its biggest issues; it then describes some methods inspired from nature - genetic algorithms, arti- ficial neural networks, swarm intelligence, and the way these techniques can be used to deal with Semantic Web problems.
  2. 2. Introduction The World Wide Web is a universal medium for information and data exchange. Exploiting the huge amount of knowledge distributed on the Web is a significant challenge. Humans can understand the information, but it takes great effort to find and combine data from such a large number of sources; on the other hand, computers can easily browse through millions of pages in no time, but they are not capable of understanding the content. The Semantic Web is a new paradigm for the Web in which the semantics of information is defined, making it possible for the web to understand and satisfy the requests of people and machines to use the web resources[1]. In other words, the Semantic Web is a vision of information that is understandable by computers. It contains a set of design principles and a variety of enabling technologies. Some of the elements are expressed in formal specifications, while others are still to be rigorously described. The ontology is a key aspect of the Semantic Web, although it does not have a universally accepted definition. It is described as “a formal specification of a shared conceptualization” [2]. There is no commonly agreed ontology that every data provider would rely on; the information is heterogeneous and distributed. Existing reasoning techniques may not be able to deal with the different ontolo- gies describing the same piece of knowledge, with the high number of instances, with the lack of maintenance, the unreliability of the network, the variety of qual- ity of the information available on the web. Given this context, soft computing has an important role in coping with knowledge, and the methods inspired from natures might be able to suggest interesting solutions for these problems. This paper presents nature inspired techniques that can address some of the main issues of the Semantic Web. Genetic algorithms, swarm intelligence or neural networks could represent viable solutions for overcoming problems such as ontology alignment, concept classification, RDF query path optimization etc.
  3. 3. Semantic Web Advanced management information is the main benefit brought by the Semantic Web vision. One should stop browsing documents and start performing concrete queries. New knowledge should be inffered from the existing facts. The potential advantages of these achievements are multiple: – information can be located based on its meaning; – information from different sources can be combined, summarized, presented to the user in a improved format; – information can be integrated across different sources. 1 Technologies Semantic Web technologies can be considered in terms of layers, each of them resting on and extending the functionality of the layers beneath it. The hierarchy of the most important languages and technologies is described in the famous “Layer Cake” diagram [3]. Semantic Web Layer Cake The core technologies are RDF (Resource Description Framework) and RDFS (RDF Schema). RDF is a markup language for describing information and re- sources on the web. Any object that is uniquely identifiable by an URI (or Uniform Resource Identifier) is considered a resource. Resources have properties (attributes or characteristics). The RDF model is a collection of facts, represented by statements (triples). Each triple consists of a subject, a predicate and an object. The most common representation of the triple is the graph-based one: subject-predicate-object is seen as a node-arch-node link. The statements are unambiguous and have a uni- form structure; each concept is defined in a dedicated space on the web. For example, the statement “Jane is Tom’s mother” can be expressed is RDF as:
  4. 4. <rdf:Description> <s:isWoman>Jane</s:isWoman> <s:hasChild rdf:resource=> </rdf:Description> In order to describe general statements about classes or groups of objects, we use RDF Schema, or RDFS. RDFS provides a basic object model, while RDF refers to specific objects. The statement above can be described in RDFS as “A woman is someone’s mother”. RDF and RDFS allow us to describe aspects of a domain, but the modeling primitives are too restrictive to be of general use. The taxonomic structure of the domain, the restrictions and constraints cannot be described through this model. It is also not possible to reason over inference rules. All these limitations are overcome with the use of ontologies. Ontologies provide a common under- standing of a domain of interest. The specification is formal, which means that computers can perform reasoning about it. OWL (Web Ontology Language) is a family of ontology languages, and it is the W3C specification for creating Semantic Web applications. OWL builds upon RDF and RDFS and defines hi- erarchies and relationships between resources. Semantic Web ontologies consist of a taxonomy and a set of inference rules from which machines can make logical conclusions. A taxonomy is a system of classification that groups resources into classes and sub-classes based on their relationships and shared properties. The top layers of the Layer Cake are very important in the context of Seman- tic Web applications deployment. The trust layer deals with authentication and reliability of data and services, through the use of digital signatures, ratings by certification agencies, recommendations by trusted agents etc. The proof layer allows applications to give proof of their conclusions and it includes the actual deductive process, validation etc. Several refinements have been proposed for the Semantic Web Layer Cake. One of them, suggested by sir Tim Berners-Lee in 2006, includes new features, such as: – Rules and Inferencing Systems (RIF). It is a language for representing rules and for linking rule-based systems; the formalisms are being extended in order to encapsulate probabilistic, temporal and causal knowledge. – RDF Extraction. GRDDL “Gleaning Resource Descriptions from Dialects of Languages”) is a language that identifies when an XML document contains data compatible with RDF and it is capable of extracting the data. – Database Support for RDF. Oracle provides support for RDF and OWL databases; for the moment, the focus is on storage, rather than inferencing capabilities. There are various open source projects that offer solutions for storage - such as Jena, as well as query languages for RDF (SPARQL being the most important).
  5. 5. Revised Semantic Web Layer Cake 2 Current Problems Although the Semantic Web vision has great potential, for more than a decade it has been “a kind of academic exercise rather than a practical technology” [4]. One of the main reasons is the lack of a common understanding of what the Semantic Web can offer and, more particularly, of the role of ontologies. RDF and OWL can be confusing and complicated to understand for less-technical people. There is a huge amount of information that needs to be annotated in order to be processed and infered, and the two possible solutions for this are both hard to put into practice: either an automatic process should apply an al- gorithm that takes a piece of text and produces RDF, or people should manually annotate existing documents. The first approach - of an intelligent algorithm - is unlikely, since having such an algorithm would make the RDF and OWL seem deprecated. Manual annotation is inefficient and prone to error. One of the biggest issues of the Semantic Web is that it seems to be scat- tered into small pieces. The existing initiatives and applications focus on small domains and the access to the Semantic Web seems limited from the perspective of the average user. However, there are already a wide range of applications in existence or under development. Some typical areas seem to offer a great potential (although not fully exploited for the moment) for the development of such applications. 1. E-Science. These kind of applications involve large data collections that require computationally intensive processing. The participants are usually distributed
  6. 6. across the world. A representative project is the Gene Ontology (GO) [5] one. GO is a major bioinformatics initiative with the aim of standardizing the repre- sentation of gene and gene product attributes across species and databases. The Human Genome Project, finalized in 2003, is probably the most famous e-science project. 2. Travel Information Systems. There are efforts in the direction of building XML based specifications which would allow the interchange of information be- tween companies. The benefits would be major for the users, since they would be able to easily plan the whole trip - accommodation, transportation etc. The big issue for the moment is the inexistence of an agreed ontology for this domain. 3. Digital Libraries. Over the past years, institutions such as universities, li- braries and museums have made their large inventories of materials available online. Although they have the same goal, the implementations of these sys- tems are totally different. It is difficult for an institution to access another one’s catalogues. One solution for this problem is the use of ontologies and of some ontology mapping techniques, that would help achieve semantic interoperability. 4. Health Care. This domain stands to gain tremendous benefit by adoption of Semantic Web technologies, as it depends on the interoperability of informa- tion from many domains and processes for efficient decision support. At present, the Semantic Web is increasingly used by small and large business. Oracle (RDF management platform), IBM, Adobe (tool for adding RDF-based metadata to most of their file formats), Software AG, or Yahoo! are the most im- portant corporations that have already started working with these technologies and are already selling tools, as well as complete business solutions. In August 2008, Microsoft bought Powerset, a semantic search engine, for a reported $ 100 million. There are also open source applications, such as Protege [6] and Kowari [7], that provide building blocks for application development, making it more cost effective to develop Semantic Web products.
  7. 7. Nature Inspired Methods in the Context of Semantic Web The vast amount, the variety and heterogeneity of the data involved in the Se- mantic Web vision makes it sometimes difficult for applications to deal with it, turning many real world problems into NP-hard problems. Nature inspired rea- soning might be able to adress and solve some of these issues. Natural computing finds its source of inspiration in biological phenomena and social behaviors from mainly insects and birds. Such algorithms are able to find acceptable results for NP-hard problems within a reasonable amount of time, rather than guarantee the optimal solution. The most important methods in- spired from nature include genetic algorithms, neural networks, particle swarm or ant colony optimization etc. 3 Genetic Algorithms Genetic Algorithms (GAs) consist of some model techniques used by simple bi- ological system. These systems use reproduction to produce offspring that can better survive in their environment. Genetic algorithms use reproduction oper- ators (mutation and crossover) and strategies (’survival of the fittest’) inspired from these realities, in order to improve the quality solutions to a particular problem. The advantage of GAs compared to other algorithms and methods is that they make only few assumptions about the underlying fitness landscape and, therefore, they perform well in many different problem categories. These algorithms proceed according to a simple scheme: 1. a population of random individuals is created; 2. each individual is tested in order to determine its utility as solution; 3. a fitness value is assigned to each individual, based on the previous evalua- tion; 4. a selection process filters out the individuals with low fitness and allows those with good fitness to enter the mating pool with a higher probability; 5. a reproduction process creates offspring by combining or varying the solution candidates; 6. if the termination criterion is met, the evolution stops; otherwise, it continues starting with step 2. Genetic algorithms can be a viable solutions for different problems that Seman- tic Web is confronting with, such as RDF query path and ontology alignment optimization, Semantic Web services composition etc.
  8. 8. The possibility of querying large amounts of data from different, heterogeneous sources, in an efficient way, is an unsolved problem at the moment. In this con- text, an interesting research field is the determination of query paths - the order in which the parts of a query are evaluated. The order has a major role when it comes to the execution time of the query, thus a good algorithm for determining the query path can contribute to quick, efficient querying. Genetic algorithms have been already tested, with some success, in problems related to this field. The Iterative Improvement algorithm, followed by Simulated Annealing - also known as the ’Two-Phase Optimization’ - addresses the optimal determination of query paths. A RDF query can be seen as a chain of subject-predicate-object triples. It can be visualized as a tree, in which the leaf nodes represent the inputs and the internal nodes are relational algebra operations. The nodes in such a query can be ordered in many different ways, all of them producing the same result, but with different execution time. In these conditions, the challenge consists in de- termining the order in which the nodes should be placed, in order to optimize the response time. It is not difficult to identify the solution space of the problem as the set of all the possible RDF trees. A population can be created by randomly select- ing some of these threes (the chromosomes). A simple mutation operator would switch the order of two random nodes (triples) in a chromosome. A crossover operator would pick some of the nodes from a chromosome, conserving the order, and put them together with the missing nodes taken from a second chromosome (also conserving the order in this second chromosome). The fitness function is calculated based on the execution time. Long executing times are not desirable for a GA in an RDF query execution environment, therefore the stopping con- dition should also consider (or be complemented with) a time limit. Another interesting problem is the ontology alignment optimization. At the mo- ment, there is no general agreed standard when it comes to ontologies. The diversity of data makes it even less likely that such a standard would be possible in the near future - the standards do not often fit to the specific needs of all the participants in a potential standardization process; and it is very difficult and expensive for many organizations to reach an agreement. Thus, ontology alignment is a key aspect in order to make the knowledge exchange possible in the context of Semantic Web. Many attempts have been made to solve this issue using different combinations of matchers, such as string normalization or similarity, data type comparison, linguistic methods, inheritance analysis, graph mapping, taxonomy analysis etc. A solution involving genetic algorithms would be able to cope with huge amounts of data, without requiring human intervention. There are two difficult tasks when defining the problem from the GA point of view: the content of a tentative solu-
  9. 9. tion should be encoded in a string of values, and a good fitness function should be provided (a similarity measure function between two ontologies). Genetics for Ontology ALignments (GOAL) [8] is a software tool for optimizing ontol- ogy matching functions. GOAL defines the alignment evaluation process based on four goals: optimizing the precision, optimizing the recall, optimizing the f-measure or reducing the number of false positives. A chromosome is defined through a method that converts a bit representation to a set of floating-point numbers in the real range [0, 1]. The fitness function consists of selecting one of the parameters retrieved by an alignment evaluation. The parameters are: – precision - the percentage of items returned that are relevant; – recall - the fraction of the items that are relevant to a query; – f-measure - a harmonic mean from precision and recall; – false positives - relationships which have been provided although they are false. The algorithm has its limitations, but it has managed to find the optimal so- lution for different instances of the ontology mapping problem, in an efficient way. Semantic Web service composition consists in finding web services (available in a repository) that are able to accomplish a certain task. The task is defined in a form of a composition request that contains a set of available input pa- rameters and a set of wanted output parameters. The parameters are not the explicit values, but concepts from an ontology describing the semantics of the values. A sequence of services is called a composition. If the input parameters given in the request are provided, the services from this sequence can be subse- quently executed and will finally produce the desired output parameters. For a genetic algorithm approach, one needs to find a way of representing a web service sequence as a chromosome. A simple solution is to use strings of service identi- fiers, which can be processed by standard genetic algorithms. Considering that the chromosomes can have variable length, the normal GA operators could be modified in order to make the search more efficient. This operation either deletes the first service from a sequence, or adds a promising service to the sequence. The other standard GA operations can be easily applied. 4 Neural Networks An artificial neural network (ANN) is a system loosely modeled based on the human brain, an emulation of the biological neural system. It consists of an in- terconnected group of artificial neurons. The information is processed using a connectionist approach to computation. Generally, an ANN is an adaptive sys- tem, changing its structure according to the information that flows through the network during the learning phase. They can be used to model complex rela- tionships between inputs and outputs or to find patterns in data. In the context of Semantic Web, artificial neural networks can be used in the
  10. 10. process of ontology mapping. The heterogeneity among different ontologies is one of the biggest issues in this field nowadays. Web applications are developed by different parties, that design their own ontologies, according to their own views of the world. Many approaches have been proposed, in order to deal with this heterogeneity, but each of them has its drawbacks. A centralized ontology is very unlikely, so the efforts are now focused on distributed solutions: trying to match the individual ontologies, and possibly reuse each other as well. Most of the existing techniques are either rule-based, or learning-based, but both cat- egories have their disadvantages. A different approach combines rule-based and learning-based solutions, integrat- ing machine learning techniques, such that the weights of a concept’s semantic aspects can be learned from training examples, instead of being pre-defined. In the real world, a common problem that can occur is the lack of instance data - either in quantity or quality. This method avoids this problem, because the learning process is carried out at the schema level, instead of the instance level. Artificial neural networks are a good solution for the learning process, for many reasons: instances are represented by attribute-value pairs; the target function output is a real-valued one; fast evaluation of the learned target function is preferable. ANN are also known to perform well in the presence of noisy data. If the ontologies are to be learned from uncontrolled data, such as real existing web pages, the handling of noise becomes a real issue. Another interesting approach to the problem of ontology mapping is the use of interactive activation and competition (IAC) neural networks (NN) to search for a global optimal solution to best satisfy the ontology constraints. An IAC neu- ral network consists of a number of competitive nodes connected to each other. Each of these nodes represents a hypothesis, while the connection between two nodes is a constraint between their hypotheses. The connection can be either positive (activation) - if the hypotheses support each other - or negative (com- petition). Each connection has a weight, which is proportional to the strength of the constraint. The activation of a node is determined by more sources: – the initial activation; – the input from its adjacent nodes; – its bias; – the external input. The characteristics of ontology mapping and the mechanisms of the IAC network have common properties. The constraints in ontology mapping can be interactive or competitive between mapping hypotheses. Before applying a neural network based algorithm for learning, a preliminary mapping is made, which estimates both the linguistic and structure information of ontologies. These prior knowl- edge can be sees as external inputs or bias of a node in the IAC network.
  11. 11. 5 Swarm Intelligence Swarm intelligence is another approach to problem solving that takes inspiration from social behaviors of insects and of other animals. Particularly, ant colony optimization is one of the most successful techniques. Ant colony optimization (ACO) is inspired by the ants that deposit pheromone on the ground in order to mark some favorable path that should be followed by other members of the colony. A similar mechanism has be transposed in an algorithm for solving op- timization problems. Semantic Web reasoning systems deal with growing amounts of distributed, dy- namic resources. Swarm intelligence could be used in order to implement a RDF graph traversal algorithm. Among the main properties of swarms are adaptive- ness, robustness and scalability. These correspond to three concepts - no central control, locality and simplicity. Thus, the combination of reasoning and swarm intelligence can be a viable solution for obtaining reasoning performance by ba- sic means. A model of a decentralized system implies the traversal of a graph in order to calculate the deductive closure of the graph, with respect to the RDFS se- mantics. The role of swarm intelligence is to reduce the computational cost. In order to calculate the RDFS closure over an RDF graph, a set of rules need to be applied repeatedly to the triples in the graph. In the metaphor of ants, each insect represent one of these rules, which might be (partially) instantiated. Ants communicate with each other only locally and indirectly. Whenever the condition of a rule matches the node an ant is on, it locally adds the newly derived triple to the graph. Only the active reasoning rules are moving in the network and not the data, minimizing network traffic, since schema-data is far less numerous than instance-data. Having some transition capabilities between graph-boundaries, the method converges towards the closure. This model has been successfully implemented and the results are described in [9].
  12. 12. Conclusions The Semantic Web vision comes with the promise of a world in which a common understanding of the meaning of data can help humans and computers coop- erate. However, it takes great effort to put in practice the revolutionary ideas, since it is very difficult to agree upon standards and, afterwards, to update the existing resources according to the potential standards. For the moment, the Semantic Web seems to be scattered into small pieces, being available only on a small scale and for very specific domains. On the other hand, there is a huge amount of knowledge that can be exploited through automated processing and adapted in order to be used. In this context, methods inspired from nature seem to have the potential to address the currently unresolved problems of the Se- mantic Web. These methods can deal with numerous data and can be used to build high scalable applications. Since there is no perfection, therefore, no op- timal solution for these problems, concepts such as genetic algorithms, artificial neural networks or swarm intelligence might be able to provide good results. This paper presented some ideas of applying nature inspired methods in or- der to deal with Semantic Web’s challenges. The main aspects of Semantic Web have been described, as well as the evolution during the past decade. The ar- eas of interest and some (potential) applications have been presented and the most important problems have been introduced and explained. Finally, the pa- per presented the way methods inspired from nature can address the problems of Semantic Web. Three of the most important techniques - genetic algorithms, swarm intelligence, artificial neural networks - have been briefly described, along with the efforts of applying them for solving problems such as ontology mapping, RDF path optimization, RDF graph traversal, ontology alignment optimization. References [1] Tim Berners-Lee, James Hendler and Ora Lassila. The Semantic Web. Scientific American, May 2001. [2] Tom Gruber. What is an Ontology? ontology.html [3] Semantic Web Layer Cake. [4] Alex Iskold. Semantic Web: Difficulties with the Classic Approach. [5] Gene Ontology Project. [6] Protege. [7] Kowari. [8] Jorge Martinez-Gil, Enrique Alba, Jose F. Aldana Montes. Optimizing Ontology Alignments by Using Genetic Algorithms. Proceedings of Nature inspired Reasoning for the Semantic Web (NatuReS), 2008. [9] Kathrin Dentler, Stefan Schlobach, Christophe Guret. Semantic Web Reasoning by Swarm Intelligence. Vrije Universiteit Amsterdam
  13. 13. [10] Human Genome Project. Genome/home.shtml [11] Riccardo Leardi. Nature Inspired Methods in Chemometrics: Genetic Algorithms and Artificial Neural Networks, Volume 23 (Data Handling in Science and Technol- ogy). Elsevier BV, 2003. [12] Alexander Hogenboom, Viorel Milea, Flavius Frasincar, Uzay Kaymak. Genetic Algorithms for RDF Query Path Optimization. Proceedings of Nature inspired Rea- soning for the Semantic Web (NatuReS), 2008. [13] Thomas Weise, Steffen Bleul, Diana Comes, and Kurt Geihs. Different Approaches to Semantic Web Service Composition. WowKiVS, 2009. [14] neural network [15] John Cardiff. The Evolution of the Semantic Web. Social Media Research Group, Institute of Technology Tallaght, Dublin, Ireland.