ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010  An Advanced Optimal Web Intelligent Model for  Mining Web Us...
ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010provide a better understanding of the underlying              ...
ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010                                                              ...
ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010                              Fig. 5 The Characteristics-Enhan...
ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 20101) Link quality F(L)                                          ...
ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010                 45                                           ...
Upcoming SlideShare
Loading in...5
×

An Advanced Optimal Web Intelligent Model for Mining Web User Usage Behavior using Genetic Algorithm

162

Published on

With the continued growth and proliferation of
Web services and Web based information systems, the
volumes of user data have reached astronomical proportions.
Analyzing such data using Web Usage Mining can help to
determine the visiting interests or needs of the web user. This
type of analysis involves the automatic discovery of
meaningful patterns, which represents fine grained
navigational behavior of visitors from a large collection of
semi structured web log data. Due to the non linear and
complex nature of weblog all the existing conventional
mining techniques are failed in the process of pattern
discovery, result in only local optimal solutions. In order to
get the global optimal solutions, Web intelligent models are
required such as Genetic algorithms.
The present paper introduces the Advanced Optimal Web
Intelligent Model with granular computing nature of Genetic
Algorithms, IOG. The IOG model is designed on the
semantic enhanced content data, which works more
efficiently than that on normal data. The unique parameters
of the IOG fitness function generates balanced weights, to
represent the significance of characteristics, which yields the
dissimilar characteristics of page vector. The evaluation
function of IOG improves the page quality and reduces the
execution time to a specified number of iterations as it
considers practical measures like page, link and mean
qualities. The genetic operators are intended in drawing the
stickiness among the characteristics of a page vector. By
integrating all the above the web intelligent model IOG,
proceed towards intelligence and significantly improves the
investigation of web user usage behavior

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
162
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Advanced Optimal Web Intelligent Model for Mining Web User Usage Behavior using Genetic Algorithm

  1. 1. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010 An Advanced Optimal Web Intelligent Model for Mining Web User Usage Behavior using Genetic Algorithm Prof. V.V.R. Maheswara Rao1, Dr. V. Valli Kumari2 and Dr. K.V.S.V.N Raju2, 1 Shri Vishnu Engineering College for Women, Bhimavaram, India, Email: mahesh_vvr@yahoo.com 2 College of Engineering, Andhra University, Visakhapatnam, India, Email: vallikumari@gmail.com, kvsvn.raju@gmail.comAbstract—With the continued growth and proliferation of a way that it automatically discovers the usage behavior ofWeb services and Web based information systems, the web user.volumes of user data have reached astronomical proportions. The overall web usage mining process can be dividedAnalyzing such data using Web Usage Mining can help to into Four interdependent stages as shown in Fig 1 : Datadetermine the visiting interests or needs of the web user. Thistype of analysis involves the automatic discovery of collection, Pre Processing, Pattern Discovery and Patternmeaningful patterns, which represents fine grained Analysis.navigational behavior of visitors from a large collection ofsemi structured web log data. Due to the non linear andcomplex nature of weblog all the existing conventionalmining techniques are failed in the process of patterndiscovery, result in only local optimal solutions. In order toget the global optimal solutions, Web intelligent models arerequired such as Genetic algorithms. Fig.1.Web mining subtasks.The present paper introduces the Advanced Optimal Web Due to the dynamic and complex nature of web logIntelligent Model with granular computing nature of Genetic data, the existing systems find difficulty in handling theAlgorithms, IOG. The IOG model is designed on thesemantic enhanced content data, which works more newly emerged problems during all the phases of webefficiently than that on normal data. The unique parameters usage mining in particular, the pattern discovery. Toof the IOG fitness function generates balanced weights, to proceed towards web intelligence, reducing the need ofrepresent the significance of characteristics, which yields the human intervention, it is necessary to incorporate anddissimilar characteristics of page vector. The evaluation embed artificial intelligence into web mining tools. Tofunction of IOG improves the page quality and reduces the achieve the intelligence soft computing methodologiesexecution time to a specified number of iterations as it seem to be a good candidate.considers practical measures like page, link and mean Genetic algorithms are examples of evolutionaryqualities. The genetic operators are intended in drawing the computing methods and are optimization type algorithmsstickiness among the characteristics of a page vector. Byintegrating all the above the web intelligent model IOG, for both supervised and unsupervised techniques. Aproceed towards intelligence and significantly improves the Genetic Algorithm is an elegantly simple, yet extremelyinvestigation of web user usage behavior. powerful way for prediction and description of complex objective functions like Fitness and Evaluate in dynamicIndex Terms—Web Usage Mining, Genetic Algorithm, environment. Moreover, GA employs a set of operatorsPattern Discovery that mimic the concept of survival of Fittest by regenerating recombination of the algorithm in response to I. INTRODUCTION a calculated difference between desired solution states. Over the last decade, with the continued increase in the The goal of present paper is to deploy Intelligentusage of the WWW, web mining has been established as Optimal Genetic Algorithm IOG, in order to find thean important area of research. Whenever, the web users optimized solutions for investigating the web user usagevisit the WWW, they leave abundant information in web behavior. The important task in any intelligent mininglog, which is structurally complex, heterogeneous, high application is the preparation of a suitable target data setdimensional and incremental in nature. Analyzing such for which mining and statistical algorithms can be applied.data can help to determine the browsing interest of web For IOG, the present research frame work, Firstly,user. To do this, web usage mining focuses on concentrating on modelling the web log data due to itsinvestigating the potential knowledge from browsing complex nature. The main thrust of the data modelling ispatterns of the users and to find the correlation between to generate the semantic enhanced content data. Thethe pages on analysis. The main goal of web usage mining integration of semantic content with IOG can potentiallyis to Capture, Model and Analyze the web log data in such 44© 2010 ACEEEDOI: 01.IJNS.01.03.220
  2. 2. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010provide a better understanding of the underlying growing in nature, it is necessary for web miners to utilizerelationships among the pages. soft computing tools like Genetic Algorithms in order to The IOG Fitness function is designed in such a way get global optimal information intelligently. GA employs athat, it generates dissimilar page vectors, each vector have set of operators that mimic the concept of survival ofsimilar characteristics. In addition, the IOG Evaluation Fittest by regenerating recombination of the algorithm infunction is planned to identify the quality of optimized response to a calculated difference between desiredpopulation averaged mean for page vectors by considering solution states over a period of time.both link and page quality of each vector. The IOG frame To investigate the usage behavior of the web user, onework collectively uses Fitness, Operators and Evaluation has to extract the potential patterns through the sessions,functions, helps to investigate the web user usage behavior generated by the pre processing stage of web log. Directlysignificantly. the data can’t be given as an input to the genetic The present paper is organized as follows. The related algorithm, therefore the pre processed sessions data has towork described in section 2. In next section 3, the be modeled which is suitable for genetic environment. Byoverview of proposed work is introduced. In subsequent the data modelling semantic enhanced content data issection 4, the experimental analysis is shown. In generated, which improves performance of the GAsubsequent section 5, the conclusions are made. Finally in significantly.section 6 acknowledgements are mentioned. 3.1. Web log Data Modeling for IOG: II. RELATED WORK The web log pre-processing results in a set of n pageviews, P = {p1, p2, ···, pn}, and a set of m user Over the past two decades, major advances in the field transactions, T = {t1,t2,···,tm}, where each ti in T is aof molecular biology, coupled with advances in genomic subset of P. Pageviews are semantically meaningfultechnologies have lead to an explosive growth in entities to which mining tasks are applied. Eachbiological information generated by scientific community. transaction t can be conceptually viewed as an l-lengthIt is an interdisciplinary field involving Biology, sequence of ordered pairsComputer Science, Mathematics and Statistics to analyzebiological sequence data. t p ,w p , p ,w p ,…, p ,w p 1 Genetic algorithms have been showed to be an effective where each p p for some j in {1, 2, ···, n}, and w ptool to use in data mining and pattern Discovery [3], [4],[5], [6], [7]. Sankar K Pal, Varun Talwar [1] introduced is the weight associated with pageviews p in transaction t,several soft compute methodologies and deliberately representing its significance. In web usage mining tasksexpresses the relevance of genetic algorithms in the web the weights are, either binary representing the existence orlog scenario. William F. Punch [2] used the genetic non-existence of a pageview in the transaction or they canalgorithm optimization technique through feature be a function of the duration of the pageview in the user’sselection. In this method the error rate in data mining session. In the case of time durations, it should be notedtechniques in local optimization shows better performance that usually the time spent by a user on the last pageviewrather than in global optimization. in the session is not available. Hence for the proposed The present IOG frame work concentrating on frame work weights are associated based on binaryassigning the weights to the characteristics of initial representation.population, which minimizes the error rate in identifyingthe relation among the pages. Data analysis tools used earlier in web mining weremainly based on statistical techniques like regression andestimation. Recently GAs have been gaining the attentionof researchers for solving certain web mining problemswith the need to handle large data sets in computationallyefficient manner. III. OVERVIEW OF THE PROPOSED WORK The web usage mining is a task of applying data miningtechniques to extract useful patterns from web log data invarious stages as shown in Fig 2. These patterns can be Fig.2 Web Usage Mining Architectureused to investigate interesting characteristics of web users.Generally the web log data is complex and non-linearly 45© 2010 ACEEEDOI: 01.IJNS.01.03.220
  3. 3. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010 , ,…, 3 th Where is the weight of the j characteristic in pageview p, for 1 ≤ j ≤ r. For the whole collection of pageviews in the site, one can represent an n×r PageView - Characteristic Matrix , ,…, . The integration process, involve the transformation of user transactions in UPVM into “content-enhanced” transactions containing the semantic characteristics of the pageviews. The goal of such a transformation is to represent each user session as a vector of characteristics Fig. 3. An example of a Hypothetical UPVM rather than as a vector over pageviews. In this way, a For association rule mining the ordering of pageviews user’s session reflects not only the pages visited, but alsoin a transaction is not relevant, one can represent each user the significance of various characteristics that are relevanttransaction as a vector over the n-dimensional space of to the user’s interaction. While, in practice, there arepageviews. Given the transaction vector t (bold face lower several ways to accomplish this transformation, the mostcase letter represents a vector) as: direct approach involves mapping each pageview in a , transaction to one or more content characteristics. The , ,…, 2 range of this mapping can be representing the set of characteristics. Conceptually, the transformation can bewhere each ) for some j in {1, 2, ···, n}, if viewed as the multiplication of UPVM with PVCM. The appears in the transaction t, and 0 otherwise. result is a new matrix TCM t , t , … , t , where each tiThus, conceptually, the set of all user transactions can be is a r-dimensional vector over the set of characteristics.viewed as an m×n User-PageView Matrix, denoted by Thus, a user transaction can be represented as a contentUPVM. An example of a hypothetical user-pageview characteristics vector, reflecting the user’s interests.matrix is depicted in Fig. 3. In this example, the weights As an example of content-enhanced transactionsfor each pageview is the amount of time (e.g., in seconds) consider Fig.4 which shows a hypothetical matrix of userthat a particular user spent on the pageview. sessions (UPVM) as well as an index for the An association or sequential pattern mining corresponding Web site conceptually represented as atechniques can be applied on UPVM as described in term-pageview matrix (TPVM). Note that the transpose ofexample to obtain patterns and in turn these patterns are this TPVM is the pageview-characteristic matrix (PVCM).used to find important relationships among pages based on The UPVM simply reflects the pages visited by users inthe navigational patterns of users in the site. various sessions. On the other hand, the TPVM represents As noted earlier, it is also possible to integrate other the concepts that appear in each page. For simplicity thesources of knowledge, such as semantic information from weights are assumed with binary values.the content of web pages with the web usage mining The corresponding Characteristics-Enhancedprocess. Generally, the characteristics from the content of Transaction Matrix CETM (derived by multiplying theweb pages reflects behavior of web user. Each pageview p UPVM and the transpose of the TPVM) is depicted in Fig.can be represented as an r-dimensional characteristics 5. The resulting data is very much suitable for thevector, where r is the total number of extracted proposed IOG as initial population.characteristics from the site in a global dictionary. Thevector, denoted by p, can be given by: Fig. 4 b. Examples of TPVM Fig. 4 a . Examples of a UPVM 46© 2010 ACEEEDOI: 01.IJNS.01.03.220
  4. 4. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010 Fig. 5 The Characteristics-Enhanced Transaction Matrix from matrices of Fig. 4 possible in a chromosome but also rewards chromosomes 3.2. Intelligent Optimal Genetic Algorithm IOG: with many pages. This drawback is balanced with second GAs are local optimal algorithms that start from an term of the Fitness function. Let ID be such an idealinitial collection of pages (a population) representing dimension; the ratio t C np/abs |C| ID 1possible solution to the problem. Each page characteristics constitutes the second term of the fitness function. Itare called a chromosome and has an associated value reaches its maximum np when the dimension of C iscalled fitness function (ff) that contributes in the exactly equal to the ideal dimension ID and rapidlygeneration of new population by means of genetic decreases when the number of pages contained inoperators. Every position in a chromosome is called a chromosome C is less than or greater than ID.gene, and its value is called the allelic value. This value The chromosomes that are present in the initialmay vary on an assigned allelic alphabet. Often, the allelic population are characterized by the highest possiblealphabet is {0, 1}. At each generation, the algorithm uses variability as far as the page vectors to which the numberthe fitness function values to evaluate the survival capacity of pages belong or concerned. The evaluation of theof each page characteristics by using simple operators to population may alter this characteristic, creatingcreate a new set of artificial creators (a new population) chromosomes with high fitness where the pages belong tothat tries to improve on the current ff values by using the same vector and are very similar to each other.pieces of old ones. Evaluation is interrupted when no Moreover, the fact that pages belonging to differentsignificant improvement of the fitness function can be vectors are different in the vectorized space may not beobtained. guaranteed, as it depends both on the nature of the data3.2.1. Proposed IOG Genetic Algorithm: and the quality of the initial Vectorization process. For this reason, fitness function third term is introduced, which01: Define the Evaluation function F. measures directly the overall dissimilarity of the pages in02: t ← 0 (iteration No =0, pop size =0) the chromosomes. Let D(pi, pj ) be the Euclidean distance03: Initialize P (t). of the vectors representing pages pi and pj. then t C04: Evaluate P(t) (page from modeled data). ∑ , C, D p , p is the sum of the distance between05: Generate an offspring page O.06: t ← t+1 (new population). the pairs of pages in chromosome C and the measures the07: Select P (t) from P (t-1). total variability expressed by C. The final form of the08: Crossover P (t). fitness function for chromosome C is then09: Mutation P(t) ff C α. t C β. t C γ. t C09: Evaluate P (t).10: Go To 5 (while no termination (no of iterations)). Where α, β and γ are parameters that depend on the11: Sort P (t) (sort the pages given to the user in magnitude of the initial score and of the vectors that descending order according to their support). represent the pages. In particular α, β and γ are chosen so12: End Condition P (t) Gives the output to the user. the contributions given by t1(C), t2(C) and t3(C) are balanced. Additionally, they may be tuned to express the3.2.2. IOG – Fitness Function: relevance attributed to the different aspects represented by In the Fitness function, dc indicates number of the three terms. The goal of the GA is to find, by means ofcharacteristics included in each chromosome and nc the genetic operators, a chromosome C* such thatindicates the number of chromosomes. The population ff C max ff C ,…..,with thus contain np=dc.nc phases. The fitness functioncomputed for each chromosome is expressed as a positive 3.2.3. IOG Evaluation Function:value that is higher for “better” chromosomes and is thus The Fitness function ff that evaluates web pages is ato be maximized. It is composed of three terms. The first mathematical formulation of the user query and numerousterm is the sum of the score of the pages in chromosome evaluation functions. First, let us define the followings inC, i.e., t C ∑ C score p the simplest forms for practical considerations. where score(pi) is the original score given to page pi aspreviously described. This term considers the positiveeffect of having as many pages with as high a score as 47© 2010 ACEEEDOI: 01.IJNS.01.03.220
  5. 5. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 20101) Link quality F(L) Quality. The experimental results indicate that noticeable improvement of IOG performance over the Standard # 4 mining Techniques as shown in Fig 6. 45 40 352) Page quality F (P) Mean quality 30 25 Standard DM Technique 5 20 15 GA Results 10Where ‘m’ is the total number of links per page. 5 0 0 0 0 00 00 00 00 00 00 30 60 903) Mean quality function Mq 12 15 18 21 24 27 No. Iterations 6 Fig 6. Comparison of population quality of GA w.r.t standard DM 2 Technique Where Fmax (P) and Fmin (P) are the maximum andminimum values of the page qualities respectively after B) The IOG model compared with the standard webapplying the GA. It should be noted that the upper value of mining techniques with respective Execution. TheFmax (P) is m*n, and the least value of Fmin(P) is zero. experimental result shows the IOG is essentially takingHence, the upper limit of Mq is (m*n)/2. Application of less time when compared with Standard miningthe GA to web pages will increase some qualities of pages Techniques as shown in Fig 7.and decrease others.3.2.4. IOG – Operators: The genetic algorithm uses crossover and mutationoperators to generate the offspring of the existingpopulation. Before genetic operators are applied, parentshave been selected for evolution to the next generation.One can use the crossover and mutation algorithm andproduce next generation. The probability of deployingcrossover and mutation operators can be changed. The Fig 7. Comparison of execution of GA w.r.t standard DM Techniquegenetic operators work iteratively and are: C) The IOG ran for different cross over probabilities• Reproduction, where individual characteristics are (Pc=0, 0.25, 0.5, 0.75, 1). The average mean quality values copied according to their fitness function values (the are tabulated for ten different enhanced contents as shown higher the value of characteristic, the higher of the in table 1. probability of contributing to one or more offspring in Table 1. The values of averaged mean quality the next generation)• Crossover, in which the members reproduced in the new mating pool are mated randomly and, afterward, each pair of characteristics undergoes a cross change• Mutation, which is an occasional random alteration of the allelic value of a chromosome that occurs with small probability3.2.5. IOG – End condition: GA needs an End Condition to end the generationprocess. If no sufficient improvement found in two ormore consecutive generations, one can stop the GA D) The IOG model experimented for different cross overprocess. In other cases, one can use time limitation as a probabilities (Pc=0, 0.25, 0.5, 0.75, 1) for different no. ofcriterion for ending the process. iterations. It is showing high performance at the average mean Pc as in Fig 8. IV. EXPERIMENTAL STUDY The proposed IOG has been implemented in a standardenvironment. For the IOG algorithm number dissimilarcharacteristic vectors are given as input. Theses vectorsare generated by the weblog data modeling of IOG,considering a weblog consists of more than 3000 pages. A) The IOG model compared with the standard webmining techniques with respective population Mean 48© 2010 ACEEEDOI: 01.IJNS.01.03.220
  6. 6. ACEEE Int. J. on Network Security, Vol. 01, No. 03, Dec 2010 45 proposed IOG model helps to investigate the usage 40 behavior of the web user in any application domain. 35 Mean quality 30 Pc=0 25 Pc=0.25 VI. ACKNOWLEDGEMENTS 20 Pc=0.5 15 Pc=0.75 The authors recorded their acknowledgements to Shri 10 5 Pc=1 Vishnu Engineering College for Women, Bhimavaram, 0 Andhra University, Visakhapatnam and Acharya 300 600 900 1200 1500 1800 2100 2400 2700 Nagarjuna University, Guntur authorities for their constant No. Iterations support and cooperation. Fig 8. Population averaged Mean quality for different number of iterations at 500 pages REFERENCESE) The IOG ran for different cross over probabilities [1] Sankar K. Pal, Fellow, IEEE, Varun Talwar “Web Mining in Soft Computing Framework: Relevance, State of the Art(Pc=0, 0.25, 0.5, 0.75, 1) for different number of iterations. and Future Directions” IEEE transactions on neuralThe IOG is executing with significantly less time at the networks, vol. 13, no. 5, september 2002.average mean Pc as shown in Fig 9. [2] Behrouz Minaei-Bidgoli, William F. Punch “Using Genetic Algorithms for Data Mining Optimization in an Educational Web-based System” . [3] Freitas, A.A. A survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery, See: ww.pgia.pucpr.br/~alex/papers. A chapter of: A. Ghosh and S. Tsutsui. (Eds.) “Advances in Evolutionary Computation”. Springer-Verlag, (2002). [4] Jain, A. K.; Zongker, D. “Feature Selection: Evaluation, Application, and Small Sample Performance”, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, Feb-97. Fig 9. Execution of GA for different no. of iterations at 500 pages [5] Falkenauer E. Genetic Algorithms and Grouping Problems. John Wiley & Sons, (1998).F) In addition, the standard analysis algorithms are applied [6] Pei, M., Goodman, E.D., and Punch, W.F. "Patternon the collective output (desired patterns) generated by Discovery from Data Using Genetic Algorithms",IOG, the web user usage interests are identified as shown Proceeding of 1st Pacific-Asia Conference Knowledgein Fig 10. Discovery & Data Mining (PAKDD-97). [7] Park Y and Song M. A genetic algorithm for clustering problems. Genetic Programming 1998: Proceeding of 3rd Annual Conference, 568-575. Morgan Kaufmann, (1998). [8] De Jong K.A., Spears W.M. and Gordon D.F. (1993). Using genetic algorithms for concept learning. Machine Learning 13, 161-188, 1993. [9] M. Eirinaki and M. Vazirgiannis. Web mining for web personalization. ACM Transactions on Internet Technology (TOIT), 3(1):1_27, 2003. [10] T. Kamdar, Creating Adaptive Web Servers Using Incremental Weblog Mining, masters thesis, Computer Science Dept., Univ. of Maryland, Baltimore, C0–1, 2001 [11] Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, 1999. Fig 10. Analysis web user Usage interests [12] M. S. Chen, J. S. Park and P. S. Yu, “Efficient Data Mining for Path Traversal Patterns in a Web Environment”, IEEE V. CONCLUSION Transaction on Knowledge and Data Engineering, 1998. [13] Wang Shachi, Zhao Chengmou. Study on Enterprise The present model has proven experimentally the Competitive IntelligenceSystem.[J]. Science Technologyrelevance of web intelligent methodologies in the and Industrial, 2005.identification of the desired patterns. Associating the [14] M. Craven, S. Slattery and K. Nigam, “First-Order Learningbalanced weights generated by IOG model worked for Web Mining”, In Proceedings of the 10th Europeansignificantly better than that technique of characteristic Conference on Machine Learning, Chemnitz, 1998.selection. According to the results the proposed technique [15] Tsuyoshi, M and Saito, K. Extracting User’s Interest for Web Log Data. Proceeding of IEEE/ACM/WICis more effective in improving the page quality, Mean International Conference on Web Intelligence (WI’06),quality and reduces the execution time and hence the 2006. 49© 2010 ACEEEDOI: 01.IJNS.01.03.220

×