BY USING FEEDBACK AND K-MEAN CLUSTERING FOR REFINE WEB DATA Abstract 1. Introduction Now a day’s more web sites The explosive growth ofare developed by everyone. Among information sources available on thethem user cannot get accurate data World Wide Web, it has becomethat user required by searching on increasingly necessary for users toweb. In basically web mining can be utilize automated tools in find thedone by some page ranking desired information resources, and toalgorithms are many more. In this track and analyze their usagepaper , user going to refine the web patterns. These factors give rise topages by giving feed back or any the necessity of creating server siderating by manually or by and client side intelligent systemsautomatically. K-mean clustering that can effectively mine foralgorithm is basic algorithm used day knowledge. Web mining can beto day life. We have proposed genetic broadly defined as the discovery andalgorithm to improve cluster quality analysis of useful information fromand also accurate clusters. By also the World Wide Web. This describesapply the weblogs to our paper to the automatic search of informationmore refine. Web mining using resources available online, i.e. Webfeedback is eliminating the unwanted content mining, and the discovery ofsites in web and also it help for user access patterns from Webimproving the user data in developing servers, i.e., Web usage mining.sites. There are roughly threeKEY WORDS: Web mining, knowledge discovery domains thatclustering ,k-mean, web logs. pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource
discovery based on concepts indexing about activities performed by a useror agent based technology may also from the moment the user enters afall in this category. Web structure Web site to themining is the process of inferring moment the same user leaves it. Theknowledge from the Worldwide Web records of users’ actions within aorganization and links between Web site are stored in a log file. Eachreferences and referents in the Web. record in the log file contains theFinally, web usage mining, also client’s IP address, the date and timeknown as Web Log Mining, is the the request is received, theprocess of extracting interesting requested object and somepatterns in web access logs. additional information -such as protocol of request, size of the object etc. Figure 1 presents a sample of a Web access log file from a Web server. Figure 1: A sample of Web Server Log File 220.127.116.11 [29:23:53:25] "GET /Software.html HTTP/1.0" 200 1497 query2.lycos.cs.cmu.edu [29:23:53:36] "GET /Consumer.html HTTP/1.0" 200 1325 tanuki.twics.com [29:23:53:53] "GET /News.html HTTP/1.0" 200 1014 wpbfl2-45.gate.net [29:23:54:15] "GET / HTTP/1.0" 200 4889 wpbfl2-45.gate.net [29:23:54:16] "GET /icons/circle_logo_small.gif HTTP/1.0" 200 2624 wpbfl2-45.gate.net [29:23:54:18] "GETWe can broadly categorize Web dataclustering into (i) users’ sessions-based and (ii) link-based. The former The standard K-Meansuses the Web log data and tries to algorithm was used to cluster user’sgroup together a set of users’ traversal paths . However, it is notnavigation sessions having similar clear how the similarity measure wascharacteristics. In this framework, devised and whether the clusters areWeb-log data provide information meaningful. Associations and
sequential patterns between web neighbor queries in the algorithm cantransactions are discovered based on accelerate it. In addition, the numberApriori algorithm . A good survey on of distance calculations increasesclustering algorithms can be found . exponentially with the increase of theThe k-means algorithm is one of the dimensionality of the data .most widely used clusteringalgorithms. The algorithm partitions Many algorithms have beenthe data points (objects) into k proposed to accelerate the k-means.groups (clusters), so as to minimize The use of kd-trees is suggested tothe sum of the squared) distances accelerate the k-means. However,between the data points and the backtracking is required, a case incenter (mean) of the clusters. which the computation complexity isTo apply the k-means algorithm: increased . Kd-trees are not efficient for higher dimensions. Furthermore, • Choose k data points to it is not guaranteed that an exactinitialize the clusters match of the nearest neighbor can be • For each data point, find the found unless some extra search isnearest cluster center that is closest done as discussed . Elkan suggestsand the use of triangle inequality to Assign that data point to the accelerate the k-means. It iscorresponding cluster suggested to use R-Trees. • Update the cluster centers in Nevertheless, R-Trees may not beeach cluster using the mean of the appropriate for higher dimensionaldata points which are problems.The Partial Distance (PD)assigned to that cluster algorithm has been proposed. The • Repeat steps 2 and 3 until algorithm allows early termination ofthere are not more changes in the the distance calculation byvalues of the Means. introducing a premature exit condition in the search process. In spite of its simplicity, the k-means algorithm involves a very large As seen in the literature, thenumber of nearest neighbor queries. researchers contributed only toThe high time complexity of the k- accelerate the algorithm; there is nomeans algorithm makes it impractical contribution in cluster refinement. Infor use in the case of having a large this study, we propose a newnumber of points in the data set. algorithm to improve the k-meansReducing the large number of nearest clustering in web usage data mining.
The proposed algorithm consists of This field can automatically filltwo steps. In the first step, to avoid up by system programminglocal minima, we presented a simple algorithmsand efficient method to select initialcentroids based on mode value of the Modified access logsdata vector. And the k-meansalgorithm is applied to cluster the The modified web server logsdata vectors. Then in the second are consists of these records :(i)step, Genetic Algorithm (GA) is User’s IP address, (ii) Access time, (iii)applied to refine the cluster to Request method (“GET”, “POST”, …,improve the quality of the clusters of etc), (iv) URL of the page accessed, (v)users’ sessions. Protocol (typically HTTP/1.0), (vi) Number of bytes (vii) rating or The paper is organized as feedback.follows: the following section definesthe web access logs. Section 3 The last field is for rating topresents the standard k-means that site this site can be useful foralgorithm. Section 4 is proposed user requirements are not .this makecluster refinement algorithm with help full for refinement of web dataGenetic Algorithm (GA) to improvethe users’ session clusters.The Rating sites typically show aexperiments and results and the work series of images (or other content) inis concluded random fashion, or chosen by computer algorithm, rather than2. Web Access Logs: allowing users to choose. They then ask users for a rating or assessment, which is generally done quickly andBasic access logs without great deliberation. Users score items on a scale of 1 to 10, yes In general the web server logs or no. Others, such asare consists of these records :(i) BabeVsBabe.com, ask users toUser’s IP address, (ii) Access time, (iii) choose between two pictures.Request method (“GET”, “POST”, …, Typically, the site gives instantetc), (iv) URL of the page accessed, (v) feedback in terms of the itemsProtocol (typically HTTP/1.0), (vi) running score, or the percentage ofNumber of bytes. other users who agree with the assessment. They sometimes offer
aggregate statistics or "best" and automatically. In our experiments, we"worst" lists. Most allow users to run k-means using the correct clustersubmit their own image, sample, or number.other relevant content for others torate. Some require the submission as 1. Choose a number of clusters K.a condition of membership. 2. Initialize cluster centers n1,… nk. a. Could pick k data points and set cluster3. Standard K-Means centers to theseAlgorithm Points b. Or could randomly assign points to clusters and One of the most popular take Means of clustersclustering techniques is the k-means 3. For each data point, compute theclustering algorithm. Starting from a cluster center it is closest to (usingrandom partitioning, the algorithm some distance measure) and assignrepeatedly (i) computes the current the data point to this cluster.cluster centers (i.e. the average 4. Re-compute cluster centers (meanvector of each cluster in data space) of data points in cluster)and (ii) reassigns each data item to 5. Stop when there are no new re-the cluster whose centre is closest to assignments.it. It terminates when no morereassignments take place. By thismeans, the intra-cluster variance,that is, the sum of squares of the 4. Genetic Algorithmdifferences between data items andtheir associated cluster centers is The initial cluster centers arelocally minimized. k -means’ strength normally chosen either sequentiallyis its runtime, which is linear in the or randomly as given in the standardnumber of data elements, and its algorithm. The quality of the finalease of implementation. However, clusters based on these initial seeds.the algorithm tends to get stuck in It may leads to local minimum; this issuboptimal solutions (dependent on one of disadvantage in k-meansthe initial partitioning and the data clustering. To avoid this, in ourordering) and it works well only for method, we are selecting the modesspherically shaped clusters. It of the data vector as initial clusterrequires the number of clusters to be centers. Based on the number ofprovided or to be determined (semi-) clusters, the modes are selected one
after another. Initially the first mode considered as input to ourvalue is selected as the center for the refinement algorithm. Initially afirst cluster and the next highest random point is selected from eachfrequently occurred value is (next cluster; with this a chromosome ismode value) assigned as the center build. Like this an initial populationfor next cluster. with 10 chromosomes is build. For each chromosome the entropy is calculated as fitness value and the Genetic algorithm (GA) is global minimum is extracted. Withrandomized search and optimization this initial population, the genetictechniques guided by the principles operators such as reproduction,of evolution and natural genetics, crossover and mutation are appliedhaving a large amount of implicit to produce a new population. Whileparallelism. GA perform search in applying crossover operator, thecomplex, large and multimodal cluster points will get shuffled meanslandscapes, and provide near-optimal that a point can move from onesolutions for objective or fitness cluster to another. From this newfunction of an optimization problem. population, the local minimum fitness value is calculated and compared In this algorithm search space with global minimum. If the localare encoded in the form of strings minimum is less than the global(called chromosomes). The basic minimum then the global minimum isreason for our refinement is, in any assigned with the local minimum, andclustering algorithm the obtained the next iteration is continued withclusters will never gives us 100% the new population. Otherwise, thequality. There will be some errors next iteration is continued with theknown as misclustered. That is, a data same old population. This process isitem can be wrongly clustered. These repeated for N number of iterations.kinds of errors can be avoided byusing our refinement algorithm. GA From the following section, it ishave applications in fields as diverse shown that our refinement algorithmas VLSI design, image processing, improves the cluster quality. Theneural networks, machine learning, algorithm is given as:job shop scheduling, etc. 1. Choose a number of clusters k The cluster obtained from 2. Initialize cluster centers n1,… nkimproved k-means clustering is based on mode
3. For each data point, compute the which are collected from various webcluster center it is closest to (using servers.some distance measure) and assignthe data point to this cluster. • EPA-HTTP - a day of HTTP logs from4. Re-compute cluster centers (mean a busy WWW server.of data points in cluster) • SDSC-HTTP - a day of HTTP logs5. Stop when there are no new re- from a busy WWW server.assignments. • Calgary-HTTP - a year of HTTP logs6. GA based refinement from a CS departmental WWW a. Construct the initial server.population (p1) • ClarkNet-HTTP - two weeks of HTTP b. Calculate the global logs from a busy Internet serviceminimum (Gmin) provider WWW server. c. For i = 1 to N do • NASA-HTTP - two months of HTTP i. Perform reproduction logs from a busy WWW server. ii. Apply the crossover • Saskatchewan-HTTP - seven months operator between each parent. of HTTP logs from a University WWW iii. Perform mutation and server. get the new population. (p2) iv. Calculate the local The following table gives a brief minimum (Lmin). description about each web access v. If Gmin < Lmin then log sets. a. Gmin = Lmin; b. p1 = p2; d. Repeat Table 1: Internet Traffic Archive (Web Usage Data)5. Experiments No. of Time Server Location Requests FromWe have generated clusters using Canada 00:00:00 June Saskatchewan 2,408,625both the algorithms for several Florida 00:00:00 Julydifferent logs obtained from the NASA 3,461,612internet traffic archive Calgary Alberta, 726,739 October 24 Canada(http://ita.ee.lbl.gov/). The followingsix different web access log data setsused to test our proposed method,
All the above logs are taken with the that have a close relationship in thattimestamps have 1 second resolution. they both try to minimize the within-The logs fully preserve the originating cluster scatter while maximizing thehost and HTTP request. And these between-cluster separation in ordertraces can be freely distributed. The to find compact and well separatedlogs are an ASCII file with one line per clusters.request, with the following columns:1. host making the request. A The Dunn Index The index is definedhostname or the Internet address. by the following equation for a2. timestamp in the format "DAY specific number of clustersMON DD HH:MM:SS YYYY". d (C , C ) i j3. request given in quotes. D n ,c = min min kmaxnc diam (c k ) i = ,..., nc 1 j =i +1,..., nc =1,..., 4. HTTP reply code.5. bytes in the reply. where d(ci, cj) is the dissimilarity function between two clusters ci andSince various clustering algorithms cj defined asresult in different clusters it is d (ci , c j ) = min d ( x, y ) x∈ci , y∈c jimportant to perform an evaluation and diam(c) is the diameter of aof the results to assess their quality. cluster, which may be considered as aIn clustering, the procedure of measure of dispersion of the clusters.evaluating the results is known as The diameter of a cluster C can becluster validation and can be based defined as follows:on various measures called validity diam (C ) = min d ( x, y ) x , y∈Cmeasures. The validity measures are It is clear that if the dataset containsdivided in two categories depending compact and well-separated clusters,on whether they have any reference the distance between the clusters isto external knowledge. By external expected to be large and theknowledge we refer to a pre- diameter of the clusters is expectedspecified structure which reflects our to be small. Thus, based on theintuition about the clustering Dunn’s index definition, we maystructure of a data set. The measures conclude that large values of thethat have no reference to external index indicate the presence ofknowledge are called internal quality compact and well-separated clusters.measures and they are estimated in 5.2. DB Indexterms of quantities that involve the Given that K is the number ofdata set. Dunn’s index and DB index clusters, Ci and Cj are the closestare two internal quality measures clusters according to average
distance d and diam is the diameter separately it is also type of pageof a cluster, the DB index is defined ranking algorithm.as follows: 1 K diam (C i ) + diam (C j ) DB = K ∑max d (C i , C j ) i =1 j ≠i It is clear for the above definition that 6.Conclusions And FutureDB is the average similarity betweeneach cluster and its most similar one. Work:It is desirable for the clusters to havethe minimum possible similarity toeach other; therefore we seek Web usage mining applies dataclustering that minimizes DB. mining techniques to discover usage patterns from the Web data, In this paper we have Proposed a newEach access to a Web page is method for data logs by adding ratingrecorded in the access log of the field it will helpful for web mining andWeb server that hosts it. The also for users In the first step, theentries of a Web log file consist initial cluster centers are selectedof fields that follow a predefined based on statistical mode basedformat. The fields of the common calculation to allow the iterativelog format are: algorithm to converge to a “better” local minimum. And in the second S. Request step, we have proposed a novel IP address Access timeNO method method to improve to cluster quality 18.104.22.168 Apr 08, using Genetic Algorithm (GA) based 1 2002 08:46 GET http://www.yaledailynews.com PM refinement algorithm. The proposed 22.214.171.124 Apr 08, thing is to add the feedback field to 2 2002 08:43 POST http://www.waterski.com PM log format. Apr 08, 126.96.36.1993 2002 08:40 GET http://www.sony.com PM By this feedback we can separate the unwanted sites for that we canBy apply the rating into log file develop the an effective algorithmformat we will find out the worth of and also based on time user canthe site. Using this site developer also search the data in single site for longput effort in developing. Periodically period of time by using anydoing the web mining on the web algorithms automatically generatedata the low rated site kept rating for that blogs. Future work is
to developing an efficient algorithm  Y. Fu, K. Sandhu, and M-Y Shih.for this. Clustering of Web users based on access patterns. In Proceedings of WEBKDD, 1999.7.References:  B. Hay, K Vanhoof, and G. Wetsr Clustering navigation patterns on a Website using a sequence R. Agrawal and R. Srikant, “Fast alignment method. In Proceedings of 17thalgorithms for mining association rules,” International Joint Conference on ArtificialProc. of the 20th Intelligence, Seattle,Washington, USA,VLDB Conference, pp. 487- 499, Santiago, August, 2001.Chile, 1994. Refinement of Web usage Data Clustering  I. V. Cadez, D. Heckerman, C. Meek, P. from K-means with Genetic Algorithm 489Smyth, and S. White. Model-based  T. Kanungo, D.M. Mount, N.clustering and Netanyahu, C. Piatko, R. Silverman, andvisualization of navigation patterns on a A.Y. Wu, An efficientWeb site. Data Mining and Knowledge k-means clustering algorithm: Analysis andDiscovery, implementation. IEEE Trans. Pattern7(4):399-424, 2003. Analysis and S. Chakrabarti. Mining the Web. Morgan Machine Intelligence, 24 (7): 881-892, 2002.Kaufmann, 2003.  Z. Michalewicz, “Genetic Algorithms, Z. Chen, A.Wai-Chee Fu, and F. Chi- Data Structures" Evolution Programs,Hung Tong. Optimal algorithms for finding Springer, Newuser access York, 1992.sessions from very large Web logs. World  O. Nasraoui, H. Frigui, A. Joshi, andWide Web: Internet and Information R. Krishnapuram, “Mining Web AccessSystems, Logs Using6:259-279, 2003. Relational Competitive Fuzzy Clustering”, D. Cheng, B. Gersho, Y. Ramamurthi, to be presented at the Eight Internationaland Y. Shoham, Fast Search Algorithms for FuzzyVector Systems Association World Congress -Quantization and Pattern Recognition. IFSA 99, Taipei, August 99.Proceeding of the IEEE International  S. Oyanagi, K. Kubota, A. Nakase,Conference on Application of matrix clustering to web logAcoustics, Speech and Signal Processing, analysis and1:1-9, 1984. access prediction, in: WEBKDD2001—  N. Eiron and K. S. McCurley. MiningWeb LogDataAcrossAll CustomersUntangling compound documents on TouchtheWeb. In Proceedings of Points, Third InternationalWorkshop, 2001.ACM Hypertext,, pages 85-94, 2003.  C. Shahabe, A. M. Zarkesh, J. Abidi and V. Shah, “Knowledge discovery from J.L.R. Filho, P.C. Treleaven, C. Alippi, user’s web-pageGenetic algorithm programming navigation,” Proc. Seventh IEEE Intl.environments, IEEE Workshop on Research Issues in DataComput. 27:28-43,1994. Engineering (RIDE), 20-29, 1997.
WEBKDD 2001—Mining Web Log DataAcross All Customers Touch Points, ThirdInternational Workshop, San Francisco, CA,USA, August 26, 2001. Revised papers, vol.2356of Lecture Notes in Comp Sc, Springer,113–144, 2002.  J. Srivastava, R. Cooley, M.Deshpande, and P. Tan, Web Usage Mining:Discovery andApplications of Usage Patterns from WebData, in SIGKDD Explorations, 1(2):1-12,2000.  Xu R., and Wunsch D., Survey ofclustering algorithms. IEEE Trans. NeuralNetworks, 16 (3):645-678, 2005.