Semantic Web mining using nature inspired optimization methods


Published on

In this paper, nature inspired methods are proposed for solving problems in the field of Semantic Web mining, namely the clustering of Web resources based on their metadata, as well as the automatic classification of Web pages.

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Semantic Web mining using nature inspired optimization methods

  1. 1. Semantic Web mining using nature inspired optimization methods Diana Andreea Gorea, Lucian Bentea Faculty of Computer Science, “A.I. Cuza” University, Ia¸i, Romania s Abstract. In this paper, nature inspired methods are proposed for solv- ing problems in the field of Semantic Web mining, namely the clustering of Web resources based on their metadata, as well as the automatic clas- sification of Web pages. 1 Introduction This paper proposes the use of nature inspired methods when solving the problem of RDF clustering, as well as that of the automatic classification of Web pages. The most promising methods that the authors found are those belonging to the Ant Colony Optimization (ACO) framework. While this paper does not aim to give an introduction ACO, the interested reader can refer to [3] for further information. The paper is organized as follows. Section 2 describes efficient heuristics in two different cases - when the number of clusters is predetermined, or when it is unknown and is part of the solution. By clustering Semantic Web resources, it is possible to find representatives for a set of similar resources and thus be able to reduce the size of large ontologies. This would also bring insight into the main concepts that an ontology contains. Section 3 summarizes the paper [6] and also brings further insight into how ACO heuristics can be used to find classification rules for Web pages. Section 4 draws the conclusions and suggests subjects for further research. 2 Clustering of Semantic Web data The data clustering problem refers to grouping a set of data into several nonempty subsets whose members are considered similar, with respect to some similarity measure. In the context of Semantic Web data, which can be represented through RDF graphs, the clustering problem becomes that of grouping individuals in the graph. An individual, also called an instance in [5], is a single resource node together with some of its neighbouring nodes, forming a subgraph that is rel- evant to that resource node. Several instance extraction methods are proposed in [5]: Immediate Properties, Concise Bounded Description (CBD)1 , or Depth 1 Concise Bounded Description:
  2. 2. Limited Crawling. The optimal method to use depends on the type of data to be processed, e.g. RDF data coming converted from a relational database, FOAF documents, etc., and the structure of its associated RDF graph. The same crite- rion holds when choosing the optimal similarity measure; the authors of [5] also propose three distance measures, one based on feature vectors (denoted simFV), one based on conceptual graphs, inspired by the similarity measure of concep- tual graphs introduced in [10], and another being an ontology based measure (denoted simOnt). 2.1 Predetermined number of clusters (the ACOC algorithm) Assuming a set Ω := {X1 , X2 , . . . , Xm } of individuals is extracted from an RDF graph G and without giving an explicit formula for the above similarity measures, the RDF data clustering problem can be formally described as the following discrete optimization problem. Let sim be a similarity measure, e.g. simFV or simOnt above. Also let n ≥ 1 be the predetermined number of clusters into which the data is to be grouped and denote by C1 , C2 , . . . , Cn ∈ Ω the variables to be determined as the centers of each cluster. By defining the variable wij through 1, the individual Xi belongs to cluster j, wij := (1) 0, otherwise, for i = 1, . . . , m and j = 1, . . . , n, the aim is to m n Maximize wij sim(Xi , Cj ), (2) i=1 j=1 such that each individual belongs to only one cluster, n wij = 1, i = 1, . . . , m, (3) j=1 and there are no empty clusters, m wij ≥ 1, j = 1, . . . , n. (4) i=1 To the best of the authors’ knowledge, there is no proof related to the NP-hard complexity of this general clustering problem. The most recent results on this subject is the article [8], which proves that the clustering problem, also known as the k-means problem, is NP-hard, in the restricted case of planar graphs. How- ever, as is the case with most discrete optimization problems, clustering of RDF data is also computationally expensive and solution approximation methods are preferred. One of the most promising algorithms for solving the previous optimization problem is Ant Colony Optimization for Clustering (ACOC), introduced in [7],
  3. 3. which is an alternative to the classic k-means algorithm, known to have sev- eral drawbacks. The numerical results in [7] show that ACOC obtains the best results, on several test cases, among various approximation methods, including the k-means algorithm. It also achieves this with the highest convergence rate, therefore only requiring a few iteration steps to detect the optimum. Since ACOC is part of the Ant Colony Optimization framework, the idea is to have several ants “foraging” for the optimum, thus avoiding premature convergence due to local optima. Apart from using the idea of pheromone trails, each node to be explored also contains a heuristic value, representing the estimated global gain from picking that node; this is used to accelerate the convergence of the algo- rithm. Eventually, ants are grouped into clusters and a solution to the original RDF clustering problem can be obtained through a decoding algorithm. 2.2 Variable number of clusters (from SSCFL to RDF clustering) In the case when the number of clusters is not predetermined, but only a fixed number of individuals are allowed to live in each cluster, the previous problem can be formulated as a Single Source Capacitated Facility Location (SSCFL) problem, which can be described as follows. Consider several facilities (e.g. med- ical or telecommunications facilities) that are installed at different locations in a city. These facilities provide goods to a number of customers, whose demands are known beforehand. Each facility comes with the necessary logistics to create a physical network that would allow customers to connect to the facility. How- ever, each facility only provides a fixed amount of resources to the customers who connect to it. The available amount of resources corresponding to a facility is also called its capacity; hence the adjective capacitated in the name of this optimization problem. The question is which of the facilities to open and which customers should be assigned to each open facility, so that the total costs of opening the facilities and of creating the physical networks are minimized, while making sure that each customer’s demand is satisfied by exactly one facility. In Figure 1, a solution to a particular SSCFL problem is represented. The customers are the light green round rectangles, while the facilities are the light red circles. The arrows denote assignment relations - the tip of the arrow points to the facility to which the customer is assigned. The number on each facility node designates its capacity, while the number on each customer node represents its demand. Notice that the given solution is feasible, i.e. the total demand of the customers assigned to a facility does not exceed its maximum capacity and no customers are left unassigned. Also, in this case, it was decided that three facilities (having capacities 1, 6, 10) remain closed. In order to adapt the SSCFL problem to RDF clustering, customers are the same with the individuals that need to be grouped and the facilities represent the center of the clusters, which can be activated or not. Thus, consider the variable wij defined as in the previous subsection and let yi ∈ {0, 1} be the Boolean variable specifying whether the i-th facility is to be opened or not, for all i. Also, denote by αi the cost of opening the i-th facility, which is the same with the cost of taking the individual Xi to be a cluster center, and by αij the
  4. 4. 1.5 2.2 1.3 2 3 2.5 10 8 2.5 5 2 1 1.2 6 1.7 Fig. 1. Solution to a particular SSCFL problem cost of assigning the j-th customer to the i-th facility, for all i, j with 1 ≤ i ≤ m and 1 ≤ j ≤ n. In the case of RDF data clustering, the costs αij represent the opposite of the similarity measure between the individual Xi and the cluster center Cj and they are given by: αij = −sim(Xi , Cj ), i = 1, . . . , m, j = 1, . . . , n. (5) Provided that the facilities (the potential cluster centers) have corresponding capacities u1 , u2 , . . . , um ∈ R+ , the aim of this adapted SSCFL problem is then to m n m Minimize αi yi + αij wij , (6) i=1 i=1 j=1 subject to the following constraints: - each customer is assigned to exactly one facility (each individual Xi is as- signed to exactly one cluster) n wij = 1, i = 1, . . . , m, (7) j=1 - provided that a facility is open (a cluster center is activated), the total demand of the customers assigned to it (the demand of a group of individuals to belong to the corresponding cluster) cannot exceed its capacity; also, a customer cannot be assigned to a facility that is closed (an individual cannot be represented by a cluster center that is not activated), m di wij ≤ uj yj , j = 1, . . . , n, (8) i=1
  5. 5. - a customer can either be assigned or not to a facility (an individual can either be included or not in a group), wij ∈ {0, 1}, i = 1, . . . , m, j = 1, . . . , n. (9) - facilities can either be open or close (cluster centers can either be activated or not), yi ∈ {0, 1}, i = 1, . . . , m. (10) Note: Before carrying on, notice that in a solution to this problem, there may be individuals that remain ungrouped, which is not necessarily a drawback. On the contrary, this may provide more realistic solutions to the clustering problem. The previous integer programming problem is proven in [9] to be NP-hard and therefore, heuristic solution techniques need to be created to handle its com- plexity. A survey of the more recent heuristics is given in [1], where the methods of Tabu Search, Simulated Annealing and Genetic Algorithms are compared on account of their efficiency with respect to different parameters. An alterna- tive solution based on Genetic Algorithms is also the subject of [2], in which two special crossover operators are defined, guaranteeing the feasibility of the approximations. Also, the Particle Swarm Optimization algorithm described in [11] and the Ant Colony Optimization algorithm in [13] have the potential to be adapted to the RDF clustering problem. 3 Web page classification using Ant Colony Optimization Semantic Web is a combination of data from different sources integrated in a common format as opposed to the original Web, concentrated mainly on the exchange of documents. It also has a format that connects data to objects from the real world. By doing so, the information seeker may jump from one database to another, just because they are linked because they share knowledge on the same thing [12]. However, these are all made by human knowledge and so we can also take into account the factor of subjectivism and the errors that may occur in placement, content or classification of knowledge. If in the case of user-less web pages (like portfolio sites or advertising pages) the desire to provide quality content lays only in the hands of the site owner who may or may not be aware of the mistakes, once other users appear (that have rights to upload, tag, write content) the task of keeping the information provided as accurate as possible becomes harder than ever. A study we found, shows the way and the results of how general web content can be sorted by using an Ant Colony Algorithm. We will present the study and try to connect its findings with what we know that may apply for semantic web as well.
  6. 6. 3.1 Preprocessing The challenge when dealing with web pages is that the developers do not follow every time a standardized way of creating web pages. This has many reasons: design implementation issues that may require certain tricks (fully flash based sites have no <h1> tag), lack of interest or knowledge in applying them, no or badly chosen <meta> tags (too much or not related to page content), generic <title> tags (all pages have the same title). At least regarding meta tags things started to improve once everyone realised the advantages of being well ranked on search engines. This generated a higher rate of attention to the content of those tags and a very high interest in SEO (search engine optimisation). In general, this would not be an issue for Semantic Web just because they are standardized and not yet very popular so that, in theory at least, exceptions from the rules are few. The contents of web pages can be filtered using texts preprocessing methods to obtain fewer relevant word to search for and a more human like understanding of the given text. The most difficult aspect that the methods described above must provide is the ability to handle well homographs (is one of a group of words that share the same spelling but have different meanings [14]; ex: stalk - part of a plant) and stalk (follow/harass a person); left (opposite of right) and left (past tense of leave) [15]) . For the study they used WordNet (a lexical program that offers some rela- tionships between words [4]) to filter the information. From it, they selected: - the morphological preprocessor (to combine words like: make, made, making into one word make) to reduce the number of words to search in - to identify all nouns from the text, as they may offer some relevant search information. But there is an interesting fact that nouns may have - the same spelling as verbs (a large number of examples describing this may be found in [16]) - the words lexical family. If the text has words like: roof, window and door, they may all apply to house. This is a questionable technique, as for some associated words the result may not be a real link between them (this is especially the case for homograph words), or, for other cases (as the one described above), a significant increase in efficiency. As far as Semantic Web is concerned all three methods may offer interesting alternatives to the end results: - the morphological processor is an interesting option as a word written in natural language may be linked to another, and only the latter is relevant. However, a word like left, if processed by this process may not remain in the same way, but become leave. Having this in mind, it’s probably a good idea to keep both when dealing with Semantic Web. - The distinction between nouns/verbs is also not so relevant in terms of searching a word in semantic web but it becomes significant in terms of SPARQL queries. This has, however, the advantage that it knows by the way the syntax is formed which one is the noun and which is the verb.
  7. 7. - For the connections between different types of words, has relevance only if multiple words are searched for at the same time, and some common denominators may then be used to provide results that better match as many items provided as possible For both search types, the end result should be a list of search words, with the note that, for web mining it should only contain the most relevant words, and for Semantic Web it should have first the words obtained by joining the semantics, then the morphologically obtained values (if any) and the words themselves. This may seem an unnecessary overload but it may help the end user to better understand the results given, and the first would be the most relevant. 3.2 Algorithm The Ant-miner algorithm is a variation of the Ant Colony paradigm, used in data mining. In the beginning it initialises the training set of all available training cases (web pages) and adds an empty rule list. In an Repeat-Until loop, one classification rule at a time is discovered: first, all trails are initialised with the same quantity of pheromone (giving them the same chance to be selected) and an inner rule lets the ants to select the best option. Each ant selects the path to follow based on the path followed by the previous ants due to the presence of pheromone traces. The higher the amount the better the path. In the second step, the irrelevant terms are removed so that in step three the pheromone values are updated . The inner loop continues until a condition is fulfilled (maximum number of paths is generated). After the processing of the inner loop, the highest-quality rule is chosen and added to the discovered rule list. All training sets that satisfy the rule are removed. This ensures that the next inner loop will run with fewer rules than the previous. The outer loop continues it’s execution until a criteria is satisfied (ex: some max number of uncovered cases is covered). The algorithm returns the rule list found. 3.3 Experiment The study took into account the <meta> and <title> contents of the BBC site. They chose this because of their high code writing standard, and due to the very well structured information that improved the chance of making very good connections between <meta> and content. 4 Conclusions and further research This paper shows how nature inspired optimization methods can be more effi- cient than classical, exact methods, when implementing Semantic Web mining algorithms. Among all, the Ant Colony Optimization metaheuristic proves to be one of the best solution techniques. As future work, the ideas described in
  8. 8. the previous sections need to be implemented and thoroughly tested, as nature inspired methods have rarely been used in the context of mining the Semantic Web. Such an implementation would then allow the clustering of resources based on their associated metadata, e.g. their FOAF description, the microformat in- formation they contain, etc. References 1. Arostegui, Jr., M.A., Jr., Kadipasaoglu, S.N., Khumawala, B.M., An empirical com- parison of Tabu Search, Simulated Annealing, and Genetic Algorithms for facilities location problems, International Journal of Production Economics, Vol. 103, No. 2, 742-754, 2006. 2. Cortinhal, M.J., Captivo, M.E., Genetic Algorithms for the Single Source Capac- itated Location Problem: a Computational Study, in the Proceedings of the 4th Metaheuristics International Conference, 355-359, Porto, Portugal 2001. 3. Dorigo, M., St¨tzle, T., Ant Colony Optimization, MIT Press, 2004. u 4. Fellbaum, C. (Ed.), WordNet - an electronic lexical database, MIT, 1998. 5. Grimnes, G.A., Edwards, P., Preece, A., Instance based Clustering of Semantic Web Resources, in the Proceedings of the 5th European Semantic Web Conference, LNCS 5021, Springer-Verlag, pp. 303-317, 2008. 6. Holden, N., Freitas, A.A., Web Page Classification with an Ant Colony algorithm, in the Proceedings of the 8th International Conference on Parallel Problem Solving from Nature, LNCS 3242, Springer-Verlag, pp. 1092-1102, 2004. 7. Kao, Y., Cheng, K., An ACO-Based Clustering Algorithm, in the Proceedings of the Ant Colony Optimization and Swarm Intelligence Conference, LNCS 4150, pp. 340-347, 2006. 8. Mahajan, M., Nimbhorkar, P., Varadarajan, K., The Planar k-means Problem is NP-hard, in the Proceedings of the 3rd International Workshop on Algorithms and Computation, LNCS 5431, pp. 274-285, 2009. 9. Mirchandani, P.B., Francis, R.L., Discrete location theory, New York: Wiley, 1990. 10. Montes-y-G´mez, M., Gelbukh, A., L´pez-L´pez, A., Comparison of Conceptual o o o Graphs, in Lecture Notes in Artificial Intelligence, Volume 1793, Springer-Verlag, pp. 548-556, 2000. 11. Sevkli, M., Guner, A.R., A Continuous Particle Swarm Optimization Algorithm for the Uncapacitated Facility Location Problem, in the Proceedings of the 5th In- ternational Workshop on Ant Colony Optimization and Swarm Intelligence, ANTS 2006, 316-323, Brussels, Belgium 2006. 12. The official W3C Semantic Web Activity page at 13. Venables, H., Moscardini, A., An Adaptive Search Heuristic for the Capacitated Fixed Charge Location Problem, in the Proceedings of the 5th International Work- shop on Ant Colony Optimization and Swarm Intelligence, ANTS 2006, 348-355, Brussels, Belgium 2006. 14. The wapedia page on homographs at 15. The wapedia page on homonyms at 16. Words that can be used both as nouns and verbs, http://www.dailywritingtips. com/careful-with-words-used-as-noun-and-verb/