Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 84-90 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET ©IAEME GENETIC ALGORITHM WITH A RANKING BASED OBJECTIVE FUNCTION AND INVERSE INDEX REPRESENTATION FOR WEB DATA MINING Suresh Subramanian1, Dr. Sivaprakasam2 1 2 (Research Scholar, Karpagam University, Coimbatore, India) (Department of Computer Science, Sri Vasavi College, Erode, India) ABSTRACT As we agree that the internet has become part of human life and information available at World Wide Web (WWW) has increased drastically. WWW is the potential resource for any kind of Information Retrieval(IR), however, extracting relevant information became a major issue for everyone.This paper describes the viability of using evolutionary algorithms on web mining to provide the best results for the user query. The proposed Genetic Algorithm with a Ranking Based Objective Function is used to determine the documents which best match the user query based on relevance combined with the inverse indexing model. Modifications on the presented fitness functions on GAHWM shows a significant improvement in determining the relevant files and the execution time has been reduced. Keywords: Genetic Algorithm, Inverted Index, Information Retrieval, Web Mining. 1. INTRODUCTION Perhaps one of the most used features of the internet is a search engine. Since the World Wide Web contains a lot of information, this service helps users to find the most relevant pages depending on their query. Web Mining is a new learning area developed especially for this. It collects different data across the internet and learns its contents for further processing. The data collected can be used for searching, indexing, information retrieval, and application services among many others. Genetic Algorithm is one of the search algorithms which are based on evolutionary theory. It is an example of a stochastic search algorithm which simulates the natural selection based on the theory of Charles Darwin. Genetic algorithm is composed of individuals or chromosomes that represent a solution. Each of the solution is evaluated through a fitness function which determines the correctness of the solution to the problem. 84
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME Studies on evolutionary algorithms show that it can be used on Information Retrieval methods as it can be seen as an optimization problem. An optimization problem does not require a solution that is absolutely correct, but a solution that is near or close to the correct solution.A short description of preprocessing will be presented in Section 2, before to the process of Genetic Algorithm in Section 3. Genetic operators will be discussed in Section 4, while Section 5 discuss the implementation proposed function and the comparative results will be discussed in Section 6. Section 7provides recommendations for future improvement and section 8 will discuss the conclusion. 2. PREPROCESSING To optimize the searching algorithm, the documents are first modeled into a data structure where the method can easily read these documents and determine whether there are keywords that match the user input [1]. The basic web searching method includes the terms in a user query and a database containing the files. To reduce the load process of searching each document individually, a model is used to represent these documents. Inverse index data mapping is a form of data structure where the contents of the file are mapped to the filename itself.Each wordin this collection is called a term and corresponding to eachterm we maintain a list, called inverted list, of all the documentsin which this word appears[2].For this application, Inverse-index mapping is used as each unique term found on the documents is mapped to the files which contain them.The map also contains the frequency of the term found in the document and the weight as determined by table 1. Using this model allows the algorithm to determine all the files which contains the term at constant time [3]. Parameter Weight Title 6 Headers 5 Anchor 4 Bold 3 Body 1 Table 1. Parameters used for indexing and their weights Figure 1. Inverted-index model data structure for documents 85
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 3. GENETIC ALGORITHM 3.1 Data Representation Data representation is an encoding which represents the solution to the problem. For the data representation, each chromosome contains a set of genes, and each gene holds a reference to a document number. In the proposed method, each chromosome may have variable lengths which are randomly generated. The length may be varied for every operation done on the chromosome. The chromosomes are then sorted according to the fitness value of each document. The first document on the list has the highest fitness value and the last document has the lowest fitness value [3] Figure 2. Sample chromosome generated by the algorithm 3.2 Initialization The first step taken for the algorithm is to initialize the first generation. For the first generation, a population of 50 chromosomes is randomly generated. This number of population is kept until the solution is found. Each chromosome must have at least five unique randomly selected genes and must not exceed the total number of documents. 3.3 Evaluation Based from the work of Cao, Xu, and their colleagues, the ranking of documents holds a great importance in document retrieval. The documents are ranked according to their relevance to the query, which makes the returned documents more accurate based on user request [4]. The fitness function is used to evaluate each of the chromosomes whether it is suited to be a solution or not. The function is a modified version of the fitness function presented in the GAHWM [1]. The added functionality is based on the GA experiments by Chang and Kwok which improves the function by adding a ranking method on each of the chromosomes, showing the most relevant file on the top of the list. The added utility function allows the algorithm to modify the fitness value based on its rank. Using this function, the weight of the document decreases as it goes down the list [5]. ௅ ே ௝ୀଵ ௜ୀ௝ 1 1 ‫ ܨ‬ሺܿ ሻ ൌ ൈ ෍ ቌ݂൫݀௝ ൯ ൈ ෍ ቍ ሺ1ሻ ܰ ݅ ௄ ݂൫݀௝ ൯ ൌ ෍ ‫ݓ‬௜ ሺ2ሻ ‫ݓ‬௜ ൌ ௜ୀଵ ‫ܭ‬௜ ‫ܨ‬௜ ݆ 1 ܶ ܰ ൈ ൈ ൈ ݄௜ ݆ ൈ log ൬ ൰ ൈ log ሺ3ሻ ‫ܨ ܭ‬ ‫ݐ‬௝ ܶ௜ ݂݀௜ ௝ 86
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME Variable ݅ ݆ ‫ݓ‬௜ ‫ܭ‬ ܰ ‫ݐ‬௝ ݀௝ ݂௜ ݆ ‫ܨ‬௝ ݄௜ ݆ ܶ ܶ௜ ܿ ‫ܭ‬௜ ‫ܭ‬ Description Term in the document Current document Weight of term in the document Total number of terms in the user query Total number of documents in the chromosome Unique terms in the current document Current document in the chromosome Frequency of the term in the document Total number of terms in the document Weight of the term in the document Total frequency of terms in the index database Total frequency of the current term in the index database Current chromosome Frequency of term I in the user query Number of terms in the user Query Table 1: Terms and variables description 3.4 Breeding For setting up the next generation, 10 chromosomes are randomly chosen from the current population. The top two fittest individual are chosen as parents for a chromosome in the next generation. This method is repeated until the required number of chromosomes is achieved for the current population. 3.5 End Since there is no assurance that GA will reach the optimal solution in an infinite amount of time, the algorithm will stop in a specified number of generations [6]. The algorithm will return the chromosome with the highest fitness value for the current population. The most fitted chromosome is obtained when there is no change on its genes until the end of the execution. 4. OPERATIONS 4.1 Crossover The algorithm uses “cut and splice” approach for the crossover operation. Each parent chromosome selected in the current population will have a separate choice for the crossover point. This approach is the optimal choice for the crossover operation as the parent chromosomes have varying lengths. This crossover operator utilizes the variable length of the chromosomes. This operation is used because the genes have their specific position in the chromosomes [7]. The head of the chromosome also plays an important role in the fitness value as it varies the most relevant genes. The parents are divided where the crossover point is defined, each having a head and a tail. Two children are produced for each pair of parents. The first child is produced by combining the head of the first parent to the tail of the second parent, and the second child by combining the tail of the first parent to the head of the second parent. Duplicates found on the chromosomes are discarded [7]. 4.2 Mutation For the mutation function, a random gene is selected from a chromosome and is removed. A new gene is then randomly generated based from the search space, and is then examined for 87
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME uniqueness. If the new gene does not have a replicate on the chromosome, it is added to the list of genes, otherwise the process is repeated. The mutation function has a probability of 0.05 percent to occur in each child. 4.3 Modifier Operator Since each solution can contain documents that are significantly unrelated to the user query, a modifier operator is added. This function removes any documents from a chromosome if it is below a threshold value. The threshold value is equal to eighty percent of the average fitness value of the chromosome. Documents below the computed threshold are removed from the chromosome. 5. IMPLEMENTATION The algorithm is developed using Java programming language. The dataset used for the program is a collection of web pages from different universities taken from the World Wide Web Knowledge Base Project containing 8276 files. The program is developed in Windows Vista platform, and is executed in Java Eclipse SDK Fifty chromosomes are used for every generation and the chromosomes are populated until generation fifty. The sample query used in the algorithm is the string “topics to be covered in database systems”. 6. RESULTS This section will discuss the results by using the proposed algorithm as compared to GAHWM. The fitness function for GAHWM is ௅ ‫ܨ‬ሺܿሻ ൌ ෍ ቀ݂൫݀௝ ൯ቁ ௝ୀ଴ ሺ1ሻ ௄ ݂൫݀௝ ൯ ൌ ෍ ‫ݓ‬௜ ሺ2ሻ ‫ݓ‬௜ ൌ ௜ୀଵ ‫ܭ‬௜ ‫ܨ‬௜ ݆ 1 ܶ ܰ ൈ ൈ ൈ ݄௜ ݆ ൈ log ൬ ൰ ൈ log ሺ3ሻ ‫ܨ ܭ‬ ‫ݐ‬௝ ܶ௜ ݂݀௜ ௝ The performance of both algorithms is measured in terms of recall and precision. The recall is measured by the number of relevant retrieved documents in the collection of all relevant documents with respect to the user query. The precision is measured by the number of relevant retrieved documents in the collection of retrieved documents. Both are formulated as follows: ܴ݈݈݁ܿܽ ൌ ܴ݈݁݁‫݀݁ݒ݁݅ݎݐܴ݁ ݐ݊ܽݒ‬ ܴ݈݁݁‫ݐ݊ܽݒ‬ ܲ‫ ݊݋݅ݏ݅ܿ݁ݎ‬ൌ ܴ݈݁݁‫݀݁ݒ݁݅ݎݐܴ݁ ݐ݊ܽݒ‬ ܴ݁‫݀݁ݒ݅ݎݐ‬ A document is said to be relevant if it contains a number of terms greater than or equal to the terms in the user query. 88
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME GAHWM GAHWM with ranking 8276 Files (Chromosomes of length 8276) Execution Time 41.3548078 seconds 25.6544668 seconds Recall 0.99 0.78 Precision 0.0113 1.0 4512 Files (Chromosomes of length 4512) Execution Time 18.2564585 seconds 11.4751356 seconds Recall 0.96 0.63 Precision 0.0189 1.0 Table 2: Comparative Results of the fitness functions used in the algorithm 7. RECOMMENDATION Further improvements on the fitness function may be used in the genetic algorithm, such as adding proximity parameters. This parameter will measure the distance of each of the terms in the documents and may contribute greatly in evaluating the fitness value. For real life datasets, the name of the file and the URL may hold a high value for the fitness. Further tests on the different values of parameters in the algorithm, such as the size of the population and the number of parents for selection, may also improve the search performance as different values may lead to different results. 8. CONCLUSION This paper introduces a new method of evaluating the fitness value of a HTML document. Genetic algorithm is used as a search method in determining relevant files. The new method uses a ranking system to determine the relevant files. The solution shows a noticeable improvement in terms of precision, returning only files that are relevant to the user query and the improvement in the execution time as well. The recall of the new method is lower than that of GAHWM, but is able to return a good number of documents. Based from the result, the new fitness function presented in determining the fitness value is better as compared to GAHWM. 9. REFERENCES [1] [2] [3] [4] [5] [6] Al-Dallal, A., & Shaker, R. (2009a). Proceedings from GCC Conference & Exhibition 5th IEEE: Genetic algorithm in web search using inverted index representation. Kuwait: GCC. Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. Inverted Indexes for Phrases and Strings SIGIR’11, July 24–28, 2011, Beijing, China. Al-Dallal, A.,& Shaker, R. (2009b).Genetic algorithm based mining for HTML document. Retrieved from http://wwwis.win.tue.nl/bnaic2009/papers/junk/bnaic2009_submission_87.pdf Coa, Y., Xu, J., Liu,T-Y., Li, H., Huang, Y & Hon, H-W. (2006). Adapting Ranking SVM to Document RetrievalRetrieved from http://research.microsoft.com/en-us/people/tyliu/cao-et-al-sigir2006.pdf Fan, W., Fox, E., Pathak, P., & Wu, H. (2004a). The effects of fitness functions on genetic programming-based ranking discovery for web. Journal of the American Society for Information Science and Technology, 55 (7), 628-636. Bokar, P., &Patil, L. (2013). Web information retrieval using genetic algorithm-particle swarm optimization. International Journal of Future Computer and Communication, 2 (6), 595-599. 89
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME [7] [8] [9] [10] [11] [12] [13] [14] [15] Hutt, B. & Warwick, K,. (2003) Synapsing Variable Length Crossover: Biologically Inspired Crossover for Variable Length Genomes. Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Roanne, France. 198-215. Mashagbal, E., Mashagbal, F.,&Nassar, M. (2011). Query optimization using genetic algorithms in the vector space model. International Journal of Computer Science, Issue8 (3), 457-450. Vizine,André., de Castro, L., &Gudwin, R., (2005). An Evolutionary Algorithm to Optimize Web Document Retrieval. Retrieved from http://www.dca.fee.unicamp.br/~gudwin/ftp/publications/Kimas05-2.pdf Sathya, S.& Simon, P. (2009). Review on Applicability of Genetic Algorithm to Web Search. International Journal of Computer Theory and Engineering, Vol. 1, (4),450-455. Joshi, A. &Todwal, S. (2003). Evolutionary Machine Learning for Web Mining Retrieved from http://www-cs-students.stanford.edu/~amrutaj/work/papers/tencon03.pdf Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, “Research on Classification Algorithms and its Impact on Web Mining”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. Priti Bhardwaj and Rahul Johari, “Routing in Delay Tolerant Network using Genetic Algorithm”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 2, 2013, pp. 590 - 597, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information Retrieval for the Purpose of Author Identification”, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010, pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413. Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models and Methodologies for Information Retrieval”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 90