By

BY USING FEEDBACK AND K-MEAN CLUSTERING FOR
REFINE WEB DATA

Abstract 1. Introduction
Now a day’s more web sites The explosive growth of
are developed by everyone. Among information sources available on the
them user cannot get accurate data World Wide Web, it has become
that user required by searching on increasingly necessary for users to
web. In basically web mining can be utilize automated tools in find the
done by some page ranking desired information resources, and to
algorithms are many more. In this track and analyze their usage
paper , user going to refine the web patterns. These factors give rise to
pages by giving feed back or any the necessity of creating server side
rating by manually or by and client side intelligent systems
automatically. K-mean clustering that can effectively mine for
algorithm is basic algorithm used day knowledge. Web mining can be
to day life. We have proposed genetic broadly defined as the discovery and
algorithm to improve cluster quality analysis of useful information from
and also accurate clusters. By also the World Wide Web. This describes
apply the weblogs to our paper to the automatic search of information
more refine. Web mining using resources available online, i.e. Web
feedback is eliminating the unwanted content mining, and the discovery of
sites in web and also it help for user access patterns from Web
improving the user data in developing servers, i.e., Web usage mining.
sites.
There are roughly three
KEY WORDS: Web mining, knowledge discovery domains that
clustering ,k-mean, web logs. pertain to web mining: Web Content
Mining, Web Structure Mining, and
Web Usage Mining. Web content
mining is the process of extracting
knowledge from the content of
documents or their descriptions. Web
document text mining, resource

discovery based on concepts indexing about activities performed by a user
or agent based technology may also from the moment the user enters a
fall in this category. Web structure Web site to the
mining is the process of inferring moment the same user leaves it. The
knowledge from the Worldwide Web records of users’ actions within a
organization and links between Web site are stored in a log file. Each
references and referents in the Web. record in the log file contains the
Finally, web usage mining, also client’s IP address, the date and time
known as Web Log Mining, is the the request is received, the
process of extracting interesting requested object and some
patterns in web access logs. additional information -such as
protocol of request, size of the object
etc. Figure 1 presents a sample of a
Web access log file from a Web
server.

Figure 1: A sample of Web Server Log
File
141.243.1.172 [29:23:53:25] "GET /Software.html
HTTP/1.0" 200 1497
query2.lycos.cs.cmu.edu [29:23:53:36] "GET
/Consumer.html HTTP/1.0" 200 1325
tanuki.twics.com [29:23:53:53] "GET /News.html
HTTP/1.0" 200 1014
wpbfl2-45.gate.net [29:23:54:15] "GET /
HTTP/1.0" 200 4889
wpbfl2-45.gate.net [29:23:54:16] "GET
/icons/circle_logo_small.gif HTTP/1.0" 200
2624
wpbfl2-45.gate.net [29:23:54:18] "GET
We can broadly categorize Web data
clustering into (i) users’ sessions-
based and (ii) link-based. The former The standard K-Means
uses the Web log data and tries to algorithm was used to cluster user’s
group together a set of users’ traversal paths . However, it is not
navigation sessions having similar clear how the similarity measure was
characteristics. In this framework, devised and whether the clusters are
Web-log data provide information meaningful. Associations and

sequential patterns between web neighbor queries in the algorithm can
transactions are discovered based on accelerate it. In addition, the number
Apriori algorithm . A good survey on of distance calculations increases
clustering algorithms can be found . exponentially with the increase of the
The k-means algorithm is one of the dimensionality of the data .
most widely used clustering
algorithms. The algorithm partitions Many algorithms have been
the data points (objects) into k proposed to accelerate the k-means.
groups (clusters), so as to minimize The use of kd-trees is suggested to
the sum of the squared) distances accelerate the k-means. However,
between the data points and the backtracking is required, a case in
center (mean) of the clusters. which the computation complexity is
To apply the k-means algorithm: increased . Kd-trees are not efficient
for higher dimensions. Furthermore,
• Choose k data points to it is not guaranteed that an exact
initialize the clusters match of the nearest neighbor can be
• For each data point, find the found unless some extra search is
nearest cluster center that is closest done as discussed . Elkan suggests
and the use of triangle inequality to
Assign that data point to the accelerate the k-means. It is
corresponding cluster suggested to use R-Trees.
• Update the cluster centers in Nevertheless, R-Trees may not be
each cluster using the mean of the appropriate for higher dimensional
data points which are problems.The Partial Distance (PD)
assigned to that cluster algorithm has been proposed. The
• Repeat steps 2 and 3 until algorithm allows early termination of
there are not more changes in the the distance calculation by
values of the Means. introducing a premature exit
condition in the search process.
In spite of its simplicity, the k-
means algorithm involves a very large As seen in the literature, the
number of nearest neighbor queries. researchers contributed only to
The high time complexity of the k- accelerate the algorithm; there is no
means algorithm makes it impractical contribution in cluster refinement. In
for use in the case of having a large this study, we propose a new
number of points in the data set. algorithm to improve the k-means
Reducing the large number of nearest clustering in web usage data mining.

The proposed algorithm consists of This field can automatically fill
two steps. In the first step, to avoid up by system programming
local minima, we presented a simple algorithms
and efficient method to select initial
centroids based on mode value of the Modified access logs
data vector. And the k-means
algorithm is applied to cluster the The modified web server logs
data vectors. Then in the second are consists of these records :(i)
step, Genetic Algorithm (GA) is User’s IP address, (ii) Access time, (iii)
applied to refine the cluster to Request method (“GET”, “POST”, …,
improve the quality of the clusters of etc), (iv) URL of the page accessed, (v)
users’ sessions. Protocol (typically HTTP/1.0), (vi)
Number of bytes (vii) rating or
The paper is organized as feedback.
follows: the following section defines
the web access logs. Section 3 The last field is for rating to
presents the standard k-means that site this site can be useful for
algorithm. Section 4 is proposed user requirements are not .this make
cluster refinement algorithm with help full for refinement of web data
Genetic Algorithm (GA) to improve
the users’ session clusters.The Rating sites typically show a
experiments and results and the work series of images (or other content) in
is concluded random fashion, or chosen by
computer algorithm, rather than
2. Web Access Logs: allowing users to choose. They then
ask users for a rating or assessment,
which is generally done quickly and
Basic access logs
without great deliberation. Users
score items on a scale of 1 to 10, yes
In general the web server logs
or no. Others, such as
are consists of these records :(i)
BabeVsBabe.com, ask users to
User’s IP address, (ii) Access time, (iii)
choose between two pictures.
Request method (“GET”, “POST”, …,
Typically, the site gives instant
etc), (iv) URL of the page accessed, (v)
feedback in terms of the item's
Protocol (typically HTTP/1.0), (vi)
running score, or the percentage of
Number of bytes.
other users who agree with the
assessment. They sometimes offer

aggregate statistics or "best" and automatically. In our experiments, we
"worst" lists. Most allow users to run k-means using the correct cluster
submit their own image, sample, or number.
other relevant content for others to
rate. Some require the submission as 1. Choose a number of clusters K.
a condition of membership. 2. Initialize cluster centers n1,… nk.
a. Could pick k data
points and set cluster
3. Standard K-Means centers to these
Algorithm Points
b. Or could randomly
assign points to clusters and
One of the most popular
take Means of clusters
clustering techniques is the k-means
3. For each data point, compute the
clustering algorithm. Starting from a
cluster center it is closest to (using
random partitioning, the algorithm
some distance measure) and assign
repeatedly (i) computes the current
the data point to this cluster.
cluster centers (i.e. the average
4. Re-compute cluster centers (mean
vector of each cluster in data space)
of data points in cluster)
and (ii) reassigns each data item to
5. Stop when there are no new re-
the cluster whose centre is closest to
assignments.
it. It terminates when no more
reassignments take place. By this
means, the intra-cluster variance,
that is, the sum of squares of the 4. Genetic Algorithm
differences between data items and
their associated cluster centers is The initial cluster centers are
locally minimized. k -means’ strength normally chosen either sequentially
is its runtime, which is linear in the or randomly as given in the standard
number of data elements, and its algorithm. The quality of the final
ease of implementation. However, clusters based on these initial seeds.
the algorithm tends to get stuck in It may leads to local minimum; this is
suboptimal solutions (dependent on one of disadvantage in k-means
the initial partitioning and the data clustering. To avoid this, in our
ordering) and it works well only for method, we are selecting the modes
spherically shaped clusters. It of the data vector as initial cluster
requires the number of clusters to be centers. Based on the number of
provided or to be determined (semi-) clusters, the modes are selected one

after another. Initially the first mode considered as input to our
value is selected as the center for the refinement algorithm. Initially a
first cluster and the next highest random point is selected from each
frequently occurred value is (next cluster; with this a chromosome is
mode value) assigned as the center build. Like this an initial population
for next cluster. with 10 chromosomes is build. For
each chromosome the entropy is
calculated as fitness value and the
Genetic algorithm (GA) is global minimum is extracted. With
randomized search and optimization this initial population, the genetic
techniques guided by the principles operators such as reproduction,
of evolution and natural genetics, crossover and mutation are applied
having a large amount of implicit to produce a new population. While
parallelism. GA perform search in applying crossover operator, the
complex, large and multimodal cluster points will get shuffled means
landscapes, and provide near-optimal that a point can move from one
solutions for objective or fitness cluster to another. From this new
function of an optimization problem. population, the local minimum fitness
value is calculated and compared
In this algorithm search space with global minimum. If the local
are encoded in the form of strings minimum is less than the global
(called chromosomes). The basic minimum then the global minimum is
reason for our refinement is, in any assigned with the local minimum, and
clustering algorithm the obtained the next iteration is continued with
clusters will never gives us 100% the new population. Otherwise, the
quality. There will be some errors next iteration is continued with the
known as misclustered. That is, a data same old population. This process is
item can be wrongly clustered. These repeated for N number of iterations.
kinds of errors can be avoided by
using our refinement algorithm. GA From the following section, it is
have applications in fields as diverse shown that our refinement algorithm
as VLSI design, image processing, improves the cluster quality. The
neural networks, machine learning, algorithm is given as:
job shop scheduling, etc.
1. Choose a number of clusters k
The cluster obtained from 2. Initialize cluster centers n1,… nk
improved k-means clustering is based on mode

3. For each data point, compute the which are collected from various web
cluster center it is closest to (using servers.
some distance measure) and assign
the data point to this cluster. • EPA-HTTP - a day of HTTP logs from
4. Re-compute cluster centers (mean a busy WWW server.
of data points in cluster) • SDSC-HTTP - a day of HTTP logs
5. Stop when there are no new re- from a busy WWW server.
assignments. • Calgary-HTTP - a year of HTTP logs
6. GA based refinement from a CS departmental WWW
a. Construct the initial server.
population (p1) • ClarkNet-HTTP - two weeks of HTTP
b. Calculate the global logs from a busy Internet service
minimum (Gmin) provider WWW server.
c. For i = 1 to N do • NASA-HTTP - two months of HTTP
i. Perform reproduction logs from a busy WWW server.
ii. Apply the crossover • Saskatchewan-HTTP - seven months
operator between each parent. of HTTP logs from a University WWW
iii. Perform mutation and server.
get the new population. (p2)
iv. Calculate the local The following table gives a brief
minimum (Lmin). description about each web access
v. If Gmin < Lmin then log sets.
a. Gmin = Lmin;
b. p1 = p2;
d. Repeat
Table 1: Internet Traffic Archive
(Web Usage Data)
5. Experiments
No. of Time
Server Location
Requests From
We have generated clusters using Canada 00:00:00 June
Saskatchewan 2,408,625
both the algorithms for several
Florida 00:00:00 July
different logs obtained from the NASA 3,461,612
internet traffic archive Calgary
Alberta,
726,739
October 24
Canada
(http://ita.ee.lbl.gov/). The following
six different web access log data sets
used to test our proposed method,

All the above logs are taken with the that have a close relationship in that
timestamps have 1 second resolution. they both try to minimize the within-
The logs fully preserve the originating cluster scatter while maximizing the
host and HTTP request. And these between-cluster separation in order
traces can be freely distributed. The to find compact and well separated
logs are an ASCII file with one line per clusters.
request, with the following columns:
1. host making the request. A The Dunn Index The index is defined
hostname or the Internet address. by the following equation for a
2. timestamp in the format "DAY specific number of clusters
MON DD HH:MM:SS YYYY". 

 d (C , C ) 
 
i j
3. request given in quotes. D n ,c = min  min  
kmaxnc diam (c k ) 
i = ,..., nc
1 j =i +1,..., nc

  =1,..., 
4. HTTP reply code.
5. bytes in the reply. where d(ci, cj) is the dissimilarity
function between two clusters ci and
Since various clustering algorithms cj defined as
result in different clusters it is d (ci , c j ) = min d ( x, y )
x∈ci , y∈c j
important to perform an evaluation and diam(c) is the diameter of a
of the results to assess their quality. cluster, which may be considered as a
In clustering, the procedure of measure of dispersion of the clusters.
evaluating the results is known as The diameter of a cluster C can be
cluster validation and can be based defined as follows:
on various measures called validity diam (C ) = min d ( x, y )
x , y∈C
measures. The validity measures are
It is clear that if the dataset contains
divided in two categories depending
compact and well-separated clusters,
on whether they have any reference
the distance between the clusters is
to external knowledge. By external
expected to be large and the
knowledge we refer to a pre-
diameter of the clusters is expected
specified structure which reflects our
to be small. Thus, based on the
intuition about the clustering
Dunn’s index definition, we may
structure of a data set. The measures
conclude that large values of the
that have no reference to external
index indicate the presence of
knowledge are called internal quality
compact and well-separated clusters.
measures and they are estimated in
5.2. DB Index
terms of quantities that involve the
Given that K is the number of
data set. Dunn’s index and DB index
clusters, Ci and Cj are the closest
are two internal quality measures
clusters according to average

distance d and diam is the diameter separately it is also type of page
of a cluster, the DB index is defined ranking algorithm.
as follows:
1 K  diam (C i ) + diam (C j ) 
DB =
K
∑max  d (C i , C j )

i =1
j ≠i

 

It is clear for the above definition that 6.Conclusions And Future
DB is the average similarity between
each cluster and its most similar one. Work:
It is desirable for the clusters to have
the minimum possible similarity to
each other; therefore we seek Web usage mining applies data
clustering that minimizes DB. mining techniques to discover usage
patterns from the Web data, In this
paper we have Proposed a new
Each access to a Web page is method for data logs by adding rating
recorded in the access log of the field it will helpful for web mining and
Web server that hosts it. The also for users In the first step, the
entries of a Web log file consist initial cluster centers are selected
of fields that follow a predefined based on statistical mode based
format. The fields of the common calculation to allow the iterative
log format are: algorithm to converge to a “better”
local minimum. And in the second
S. Request
step, we have proposed a novel
IP address Access time
NO method method to improve to cluster quality
115.242.159.123
Apr 08, using Genetic Algorithm (GA) based
1 2002 08:46 GET http://www.yaledailynews.com
PM refinement algorithm. The proposed
125.242.149.122
Apr 08, thing is to add the feedback field to
2 2002 08:43 POST http://www.waterski.com
PM log format.
Apr 08,
234.222.111.152
3 2002 08:40 GET http://www.sony.com
PM By this feedback we can separate the
unwanted sites for that we can
By apply the rating into log file develop the an effective algorithm
format we will find out the worth of and also based on time user can
the site. Using this site developer also search the data in single site for long
put effort in developing. Periodically period of time by using any
doing the web mining on the web algorithms automatically generate
data the low rated site kept rating for that blogs. Future work is

to developing an efficient algorithm [17] Y. Fu, K. Sandhu, and M-Y Shih.
for this. Clustering of Web users based on access
patterns. In
Proceedings of WEBKDD, 1999.
7.References: [20] B. Hay, K Vanhoof, and G. Wetsr
Clustering navigation patterns on a Website
using a sequence
[1] R. Agrawal and R. Srikant, “Fast alignment method. In Proceedings of 17th
algorithms for mining association rules,” International Joint Conference on Artificial
Proc. of the 20th Intelligence, Seattle,Washington, USA,
VLDB Conference, pp. 487- 499, Santiago, August, 2001.
Chile, 1994. Refinement of Web usage Data Clustering
[6] I. V. Cadez, D. Heckerman, C. Meek, P. from K-means with Genetic Algorithm 489
Smyth, and S. White. Model-based [26] T. Kanungo, D.M. Mount, N.
clustering and Netanyahu, C. Piatko, R. Silverman, and
visualization of navigation patterns on a A.Y. Wu, An efficient
Web site. Data Mining and Knowledge k-means clustering algorithm: Analysis and
Discovery, implementation. IEEE Trans. Pattern
7(4):399-424, 2003. Analysis and
[7] S. Chakrabarti. Mining the Web. Morgan Machine Intelligence, 24 (7): 881-892, 2002.
Kaufmann, 2003. [30] Z. Michalewicz, “Genetic Algorithms,
[8] Z. Chen, A.Wai-Chee Fu, and F. Chi- Data Structures" Evolution Programs,
Hung Tong. Optimal algorithms for finding Springer, New
user access York, 1992.
sessions from very large Web logs. World [34] O. Nasraoui, H. Frigui, A. Joshi, and
Wide Web: Internet and Information R. Krishnapuram, “Mining Web Access
Systems, Logs Using
6:259-279, 2003. Relational Competitive Fuzzy Clustering”,
[9] D. Cheng, B. Gersho, Y. Ramamurthi, to be presented at the Eight International
and Y. Shoham, Fast Search Algorithms for Fuzzy
Vector Systems Association World Congress -
Quantization and Pattern Recognition. IFSA 99, Taipei, August 99.
Proceeding of the IEEE International [36] S. Oyanagi, K. Kubota, A. Nakase,
Conference on Application of matrix clustering to web log
Acoustics, Speech and Signal Processing, analysis and
1:1-9, 1984. access prediction, in: WEBKDD2001—
[12] N. Eiron and K. S. McCurley. MiningWeb LogDataAcrossAll Customers
Untangling compound documents on Touch
theWeb. In Proceedings of Points, Third InternationalWorkshop, 2001.
ACM Hypertext,, pages 85-94, 2003. [39] C. Shahabe, A. M. Zarkesh, J. Abidi
and V. Shah, “Knowledge discovery from
[15] J.L.R. Filho, P.C. Treleaven, C. Alippi, user’s web-page
Genetic algorithm programming navigation,” Proc. Seventh IEEE Intl.
environments, IEEE Workshop on Research Issues in Data
Comput. 27:28-43,1994. Engineering
(RIDE), 20-29, 1997.

WEBKDD 2001—Mining Web Log Data
Across All Customers Touch Points, Third
International Workshop, San Francisco, CA,
USA, August 26, 2001. Revised papers, vol.
2356
of Lecture Notes in Comp Sc, Springer,
113–144, 2002.
[44] J. Srivastava, R. Cooley, M.
Deshpande, and P. Tan, Web Usage Mining:
Discovery and
Applications of Usage Patterns from Web
Data, in SIGKDD Explorations, 1(2):1-12,
2000.
[46] Xu R., and Wunsch D., Survey of
clustering algorithms. IEEE Trans. Neural
Networks, 16 (3):
645-678, 2005.

By

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

Similar to By

Similar to By (20)

Recently uploaded

Recently uploaded (20)

By