Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement

Tutorial Paper
Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013

Advance Clustering Technique Based on Markov
Chain for Predicting Next User Movement
Harish Kumar1, Dr. Anil Kumar Solanki2
1

PhD Scholar, Mewar University, 2Professor, BIT Jhansi
Emial id : harishtaluja@gmail.com
natural step, and it is now the focus of an increasing number
of researchers.Web usage mining consists of three phases,
preprocessing, pattern discovery, and pattern analysis. After the completion of these three phases the user can find the
required usage patterns and use this information for the specific needs. The reliability of the previously developed methods for finding similar patterns is only up to 50%. Zidrina
research introduced a mutual approach which takes users
browsing history and text from the links text to analyse users’ behavior. Tanasa research proposed few approaches for
extracting sequential patterns with low support from Web
usage data. These approaches were also instantiated in concrete methods such as the “Cluster & Discover” and “Divide
& Discover”. The aim all the previous research is to discover
similar patterns in Web log data is to obtain information about
the navigational behavior of the users.
Web usage mining, from the data mining aspect, is the
task of applying data mining techniques to discover usage
patterns from Web data in order to understand and better
serve the needs of users navigating on the Web. Web usage
mining aim is to find out useful information from the educational weblogs. These useful data patterns are used to analyze behavior of user. The objective of this dissertation is to
generate a similar patterns with the help of Markov chain and
by using following algorithms like’s web logs data preparation methods, data mining algorithms for prediction and classification tasks, web text mining. The key target of the paper
is to develop methods how to improve knowledge discovery
steps mining using web log data that would reveal new prospect to the data analyst. To forecast next user movement
effectively, this study generates a beam of light for webbased recommendation system to predict next user movement, named as WebAstro.
According to the finding this WebAstro helps in web
site reorganization. While performing web log analysis, it
was discovered that insufficient interest has been paid to
web log data cleaning process. By reducing the number of
redundant records data mining process becomes much more
effective and faster. Therefore a new original cleaning framework was introduced which leaves records that only corresponds to the real user clicks. This clean method named as
Duster performs “Query based” cleaning. Clean data is use
for designing Web Graph. This method help us to draw the
web graphs that are modeled in the form of Markov Chain
and generate a new friend function for calculating probability for user next page prediction and behavior analysis[8][9].
K mean clustering algorithm is used for predicting user be

Abstract - Aim: According to the survey India is one of the
leading countries in the word for technical education and
management education. Numbers of students are increasing
day by day by the growth rate of 45% per annum. Advancement
in technology puts special effect on education system. This
helps in upgrading higher education. Some universities and
colleges are using these technologies. Weblog is one of them.
Main aim of this paper is to represent web logs using clustering
technique for predicting next user movement and user
behavior analysis. This paper moves around the web log
clustering technique based on Markov chain results .In this
paper we present an ideal approach to web clustering
(clustering web site users) and predicting their behavior for
next visit. Methodology: For generating effective result approx
14 engineering college web usage data is used and an advance
clustering approach is presenting after optimizing the other
clustering approach.Results: The user behavior is predicted
with the help of the advance clustering approach based on the
FPCM and k-mean. Proposed algorithm is used to mined and
predict user’s preferred paths. To predict the user behavior
existing approaches have been used. But the existing
approaches are not enough because of its reaction towards
noise. Thus with the help of ACM, noise is reduced, provides
more accurate result for predicting the user behavior. Approach
Implementation:The algorithm was implemented in MAT
LAB, DTRG and in Java .The experiment result proves that
this method is very effective in predicting user behavior. The
experimental results have validated the method’s effectiveness
in comparison with some previous studies.
Keyword - Markov chain, Web logs, clustering, FPCM (Fuzzy
Possiblistic C means algorithm),K-mean algorithm.

I. INTRODUCTION
A recent study by Google has found that Indians just
behind the Americans, when it comes to searching online
about educational institutions and courses. According to
the survey, the details of which were released by the online
search giant, over 45% Indian students use the internet to
research on education [10]. This spawn the massive data
related to student’s interactions with the educational web
sites. This massive data is in the form on web logs or server
log files. The research area is focused on the web log analysis
and methods how to process this web data. Finding hidden
information from Web log data is called Web usage mining.
Web Usage mining is the part of Data Mining technique.
Data Mining and Knowledge Discovery is a research
discipline involving the study of techniques to search for
patterns in large collections of data. The application of data
mining techniques to the web, called web data mining, was a
© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

66

Tutorial Paper
havior its advance clustering algorithm Fuzzy C-means (FCM)
is a well known soft clustering algorithm that allow for over
lapping clusters [1]. The overlapping clusters can be useful
in applications where restrictions imposed by crisp clustering
that force assignment of every object to a unique cluster may
not be practical. This paper emphasis on K-mean and FCM
algorithms for clustering web navigation patterns to an
educational site of NCR Colleges.

useful knowledge, user information and server access patterns
allows Web based organizations to mining user access
patterns and helps in future developments, maintenance
planning and also to target more rigorous advertising
campaigns aimed at groups of users. According to her as
popularity of the web continues to increase, there is a growing
need to develop tools and techniques that will help improve
its overall usefulness. She proposed that k-means algorithm
is used to reduce the computation intensity of the neural
network, by reducing the input set of samples. This can be
achieved by clustering the input dataset using the k-means
algorithm, and then take only discriminate samples from the
resulting clustering schema to perform the learning process.
Chu et.al.[5] proposed a two way prediction model based
on Markov models and Bayesian theorem. The prediction
result can be used for personalization, building proper
websites, promotion, getting marketing information, and
forecasting market trends etc. Markov model is assumed to
be a probability model by which users browsing behaviors
can be predicted at category level. Bayesian theorem can
also be applied to present and infer users browsing behaviors
at webpage level. By the Markov Model, the system can
effectively filter the possible category of the websites and
Bayesian theorem will help to predict websites
accuracy.R.Khanchana et. al. [6] proposed a modified
prediction model of Lee based on Markov models and
Bayesian theorem. She focuses on the preprocessing step
and amends few changes in Prediction. Author uses
hierarchical agglomerative clustering algorithm for browsing
patters and obtain several various user clusters. The data of
clusters can be projected as cluster view for replacing of the
global. As a result, the author presents an altered Prediction
Model. In the new model, the view selection will be utilized
by which user’s browsing patterns is matched and utilized
for forecasting and enhancing the accuracy confidently.

II. RELATED WORK
G.Sudhamathy et. al. [1] proposed a optimization survey
of for various web clustering algorithm. She provide a brief
overview of Fuzzy clustering algorithm, Temporal Cluster
Migration Matrices algorithm and PSO based clustering
algorithm and she find that temporal clustering migration
matrices approach is just to categorize the web users into
different clusters and to study their cluster migration behavior
over a period of time. Fuzzy clustering approach can be
applied to study the aspect of E-commerce web sites starting
from ranking the users based on their visit time and visit
frequency.PSO optimization technique that is applied on the
web session clustering concept is used for identifying more
accurate clustering sessions. After analyzing she proposed
that fuzzy clustering algorithm is simple, effective and practical
to apply. J.Vellingiri et.al.,[2]proposed an approach for fuzzy
possiblistic c means algorithm for clustering on web usage
mining to predict the user behavior[2] . In recent times, CMeans is found to be superior as its embedded fuzzy logic.
In noisy atmosphere, the memberships of FCM constantly
do not correspond well to the degree of belonging of the
data, and might be inexact. This paper uses a novel clustering
algorithm called fuzzy-possibilistic C-Means (FPCM)
algorithm, which integrates extended partition entropy and
inter class resemblance which is computed from the fuzzy set
point of view. The proposed approach uses FPCM to find
out the user behavior since it needs only the ember ship
matrix and possibilistic matrix, and is free from heavy distance
computing.
Tasawar et.al.,[3] proposed a connectivity based
clustering approach for web usage mining (WUM), He
proposed Agglomerative and Divisive approach for
clustering. Swarm based web session clustering helps in many
ways to manage the web resources effectively such as web
personalization, schema modification, website modification
and web server performance. In this paper, he proposes a
web session clustering at second level of web usage mining
(Preprocessing level). The framework approach will cover
the data preprocessing steps to prepare the web log data
and convert the categorical web log data into numerical data.A
session vector is obtained from web data and swarm
optimization could be applied to cluster the web log data.
The hierarchical cluster based approach will enhance the
existing web session techniques for more structured
information about the user sessions Vinita et.al..[4] Proposed
the possible use of the neural networks learning capabilities
to classify the web traffic data mining set. The discovery of
© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

III. METHODOLOGY
A. Web Log File
Web Mining: Web mining may be classified into three
categories, namely weblog mining, web content mining, and
web structure mining.

Fig. 1. Categorization of Web Data mining

Web content mining (WCM) is to find useful information
67

Tutorial Paper
in the content of web pages [4] e.g. free Semi-structured
datasuch as HTML code, pictures, and various unloaded
files.
Web structure mining (WSM) is use to generating a
structural summary about the web site and web pages [7][11].
Web structure mining tries to discover the link structure of
the hyperlinks at the inter document level. Web content
mining mainly focuses on the structure of inner document,
Web usage mining (WUM) is applied to the data generated
by visits to a web site, especially those contained in web log
files. I only highlighted and discussed research issues
involved in web usage data mining. Web usage mining
(WUM) or web log mining, users’ behavior or interests is
revealed by applying data mining techniques on web. Web
log files are of different types.
1. Access Log File.
2. Agent Log File
3. Referer Log File
4. Error Log File
Access Log File: It records information about which files
are being requested from web server. It is located in the
directory www/logs/.
Agent Log File: It records information about the web
clients that make requests on your server.
Referer Log File: It records information about the URL
that the web browser had been viewing immediately before
making the request on your server. This is particularly useful
when you want to determine where requests on your web
server come from and what websites are referring web traffic
to your server. It is located in the www/logs/ directory and
called Referer Log File.
Error Log File: It records information about failed requests
of your server. If someone tries to access a file on your server
that doesn’t exist, your server automatically generates an
error message. Each of these error messages is recorded in
the referrer log. It is located in the www/logs/ directory and
called Error Log File.
Three main sources of web log file are
1. Client Log File,
2. Proxy Log File
3. Server Log File.
A log file contains the following fieldThe client’s host
name or its IP address,
 The client id (generally empty and represented by a -”)
 The user login (if applicable),
 The date and time of the request,
 The operation type (GET, POST, HEAD, etc.),
 The requested resource name,
 The request status,
 The requested page size,
 The user agent (a string identifying the browser and the
operating system used),and
 The referrer of the request which is the URL of the Web
page containing the link that the user followed to get to the
current page.
User behavior can be best analyzed from client log file because
log files collected from client logs are much reliable and
© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

accurate then server log file and proxy log file. An extended
log file contains a sequence of lines containing ASCII
characters terminated by either the sequence LF or CRLF.
Log file generators should follow the line termination
convention for the platform on which they are
executed.Analyzers should accept either form. Each line may
contain either a directive or an entry. Entries consist of a
sequence of fields relating to a single HTTP transaction [8].
Fields are separated by whitespace; the use of tab characters
for this purpose is encouraged. If a field is unused in a
particular entry dash “-” marks the omitted field. Directives
record information about the logging process itself. Lines
beginning with the # character contain directives. The
following
directives
are
defined:
Version: <integer>.<integer>
The version of the extended log file format used [7][8].
This draft defines version 1.0.
Fields: [<specifier>...]
Specifies the fields recorded in the log.
Software: string
Identifies the software which generated the log.
Start-Date: <date> <time>
The date and time at which the log was started.
End-Date :< date> <time>
The date and time at which the log was finished.
Date:<date> <time>
The date and time at which the entry was added.
Remark: <text>
Comment information. Data recorded in this field should be
ignored by analysis tools.
Sample web log format is as in Figure 2.
B. Markov’s Model
The pages and hyperlinks of the World-Wide Web may
be viewed as nodes and arcs in a directed graph. The
relationship between sites and pages indicated by these
hyperlinks gives rise to what is called a Web graph. When it
is viewed as a purely mathematical object, each page forms a
node in this graph and each hyperlink forms a directed edge
from one node to another. These navigation marks are called
navigation pattern that can be used to decide the next likely
web page request based on significantly statistical
correlations. If that sequence is occurring very frequently
then this sequence indicated most likely traversal pattern. If
this pattern occurs sequentially, Markov chains have been
used to represent navigation pattern of the web site [8] [9].
Important properties of Markov Chain:
1. Markov Chain is successful in sequence matching
generation.
2. Markov model is depending on previous state.
3. Markov Chain model is Generative.
4. Markov Chain is a discrete – time stochastic process.
Markov chain model is assume to be a probability model
and used to predict provide the probability of the next link
chosen when viewing a Web page while taking into account
the trail followed to reach that page. Our measure of the
summarization ability of the model answers a question we
68

Tutorial Paper

Fig. 2. Web logs
TABLE I. USER N AVIGATION PATTERN

have often been asked about the adequacy of Markov models
in representing user Web trails. We use three type of Markov
model …
1. First Order Markov Model:
Suppose we have state space say S= {S1, S2…, Sn) at the
time t sate sequence is represented by St and transition
probability is represented by Pi j. In first order Markov chain
model state probability is depend on the previous state for
example probability of state j depends on the previous state
i.So transition probabilities are represented by following
expressions.
Pi,j = Probability of (St= j| St-1=i)
(1)
OR If we consider states at different instances of time t then
this can be represented as S (t). If T represents the number of
states in a sequence then ST = {S1, S3, S5, S1} (if T=4). This
model uses the transition probability which is given by
P (Sj (t + 1)|Si (t)) = Pij

AND

THEIR FREQUENCIES

Navigation Pat tern

Occurrence

SA B CD T

4

SE FG T

8

S BCEF T

4

SA CD T

4

SB CD T

6

S AC E T

14

SB CT

4

S DF G T

2

S D FT

10

S DT

12

SBC D FT

6

SE FT

2

(2)
a probability which state j at a time t depends on previous
state i at a time t-n. The n-order transition probability of
Markov model also denotes by
Pi ,j n= Pr{St= j | St-n= i}
(6)

(3)
(4)
2. Second Order Transition Probabilistic Model
We let Pi, k j be the second-order transition probability,
that is, the probability of the transition (A k, Aj) given that the
previous transition that occurred was (Ai, Ak).
The second-order probabilities are estimated as follows:

C. Bayesian Theorem
Bayesian’ Theorem is a theorem of probability. It can be
seen as a way of understanding how the probability that a
theory is true is affected by a new piece of evidence. Bayesian
networks (BNs), also known as belief networks, belong to
the family of probabilistic graphical models (GMs) [5].
Graphical structures represent the knowledge about an
uncertain domain. Graph node represents a random
variable,while the edges between the nodes represent
probabilistic dependencies among the corresponding random
variables. These conditional dependencies in the graph are
often estimated by using known statistical and computational
methods. It has been used in a wide variety of context like
Bayesian theorem is used to predict the most possible user’s

(5)
We consider the same navigation patterns used in
previous paper.
With this model we found some problems like State C is
not accurately showing his actual probability. The accuracy
of changing probability from a state can be increased by
separating the in paths
3. Nth Order Markov Model
Nth order Markov model solve the above problems. Pi,j n is
© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

69

Tutorial Paper

Fig. 3. Second Order Markov Model

next request. It is to be assumed that at sample space S, X
and Y are the two events.

Bayesian’ Theorem to discover, we say that P(X|Y), the
probability that T is true given that E is true, is the posterior
probability of T. The idea is that P (X|Y) represents the
probability assigned to T after taking into account the new
piece of evidence, E.
To calculate this we need, in addition to the prior
probability P(X), two further conditional probabilities
indicating how probable our piece of evidence is depending
on whether our theory is or is not true. We can represent
these as P (X|Y) and P (X|~Y), where ~X is the negation of X,
i.e. the proposition that T is false. Following procedure is
used for predicting user behavior and used for website
organization.
Experimental Methodology
WebAstro procedure for cleaning and analysis is as
follows
Step 1: Read web log from web log Data base (Web server log

(7)
The above equation no 7 indicates that X stands for a
theory or hypothesis that we are interested in testing, and Y
discover is the probability that X is true supposing that our
new piece of evidence is true. This is a conditional
probability, the probability that one proposition is true
provided that another proposition is true. Using this idea of
conditional probability to express what we want to use
represents a new piece of evidence that seems to confirm or
disconfirm the theory. In particular, P(X) represents our best
estimate of the probability for next user page request. It is
known as the prior probability of X. What we want to

Fig. 4. WebAstro Block Diagram

© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

70

Tutorial Paper
file)
Step 2: Apply DUSTER algorithm for refining web logs
 Cleaning HTML, XML, CSS and other tags from web logs.
 Remove all jpeg, jpg, gif
 Delete words like and, an, is etc.
 Reduce sized log file is kept in separate folder by the name
of WEBASTRO.
Step3: Sort the clean and refined web logs on the basis of
date and time of visits
Step4: Prepare the separate table based on the following fields.
1. User IP Table(User Identification Table)
2. Pages Navigation Table(Transaction Identification Table)
3. Duration Table(session Identification table)
Step5: Normalize the data table.
Step6: Initialize IPADDRESS field to Zero (0)
Check whether the IP address is in the IP Table or Not
If yes then Increment IPADDRESS counter by one
Else
Insert the IPADDRESS in IP table.
Step7: Initialize PAGEVISIT field to Zero (0)
Check whether the PAGE address is in the
PAGENAVIGATION or Not
If yes then Increment PAGEVISIT counter by one
Else
Invalid page and repeat step no 7
STEP8: Prepare Transaction Matrix, Similarity Matrix and
Relevance Matrix from Step No 4,5,6 and 7 until all data set
are in matrix form.
STEP 9: Apply K mean clustering algorithm for testing refined
data set and generate the proper cluster.
Let X=(X1, X2, X3… Xn) be the set of distinct n users visit
P distinct pages in session Si.
Specific user =Xi
Where Xi
K=no of web pages visited by Xi users in session
Select another user Xj from the set where
Xj
And Si
Xj Si
If Xi and Xj belongs to the same session it means that they
have common interest on the same web session then
Session_count =Session_count+1(Increment session
counter by 1)
And generate the matrix named VISITij for number of time
web page visited.
VISITij=[ Matrix] { Page I visited by the web user J}
Similarly generate the matrix for the following
 Page_count=page_count+1 (Increment the page counter
by 1)
Generate the matrix for ith page visited by jth user.
 Time_cont=Time_count+1(Increment the Time counter by
1)
Generate the Matrix for time spend by a user on a web page.
Assign the initial mean value for cluster K.
Plot the cluster by the use of specified matrix on the basis of
Session belongs, page visit and time spent on the page.

Set the threshold value for centroid ä
and calculate
the distance between different clusters.
Step10: Apply Fuzzy c-mean clustering on testing refined
data set and generate the proper cluster.
Consider a unlabelled pattern X=(X1,X2, X3… Xn)
Objective function is used to calculate WGSS.
Min Jm(U,W)=
N=NO of pattern in X
C= No of clusters
W=cluster center vector
U=membership function matrix the element of U are µi,j
µi,j=Degree of membership of Xi in the cluster j
d2ij=|| Xi - Ci|| where i d” m<“
Where m is any real number greater than 1
Ci is the d-dimension center of the cluster.
Step 11: Find the optimized solution and predict the user
behavior on the basis of cluster results, density of cluster
,distance of cluster and compare with Markov predicting
model and Bayesian Model(Two way model).
D. EXPERIMENTAL RESULT
For evaluating the proposed technique the database is

Fig. 5. User Visit per hour Graph

Fig. 6. Page view Graph

© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

71

Tutorial Paper
compared with Fuzzy clustering in comparison of K-means
clustering. For future work we should try to explore the use
of these techniques in automated software for predicting their
next visit. This helps us in analyzing user behavior and
understanding nature of user navigation. Proposed approach
helps us in web site modification on the basis of user interest.

selected from 14 colleges of Northern India Universities and
engineering colleges in the form of web logs. The program is
implemented in MATLAB and in Java Only one weak
database is taken here for experimental results. With this we
also check the complexity of algorithm to show that the output
of our approach is up to the mark and more efficient than the
other approaches. It contains total 256789 results per web
logs file approx 4503 visit per file. Before cleaning its size of
single file is approx 1.288KB and after cleaning all fields it
size reduce up to 498 kb. Proposed approach is developed in
JAVA and clustering technique is employed in testing data
set in MATLAB. After final optimization we feel that our
approach is simpler and refine than the other approaches
and this give more effective results to us for user behavior
analysis.

REFERENCES
[1] G.Sudhamathy,C.J.venkateswaran “Web log clustering
approaches-a survey” IJCSE ISSN0975-3397 vol3No7 July
2011.
[2] J. Vellingiri , S. Chenthur Pandian “Fuzzy Possibilistic CMeans Algorithm for Clustering on Web Usage Mining to
Predict the User Behavior” European Journal of Scientific
Research ISSN 1450-216X Vol.58 No.2 (2011), pp.222-230.
[3] Hussain Tasawar, Asghar Sohail and Fong Simon, “A hierarchical
cluster based preprocessing methodology for Web Usage
Mining”, 6th International Conference on Advanced
Information Management and Service (IMS), Pp. 472-477,
2010.
[4] Vinita Shrivastava, Neetesh Gupta “Performance Improvement
Of Web Usage Mining By Using Learning Based K-Mean
Clustering” International Journal of Computer Science and its
Applications ISSN 2250 – 3765.
[5] Chu-Hui Lee, Yu-Hsiang Fu “Two level prediction model for
user’s browsing behavior” Proceedings of the International
MultiConference of Engineers and Computer Scientists 2008
Vol IIMECS 2008, 19-21 March, 2008, Hong Kong.
[6] R.Khanchana and M. Punithavalli “Web Usage Mining for
Predicting Users’ Browsing Behaviors by using FPCM
Clustering” IACSIT International Journal of Engineering and
Technology, Vol. 3, No. 5, October 2011.
[7] Harish, Anil Kumar “Effective Cleaning of Educational Web
Site Usage Patterns and Predicting their Next Visit”
International Journal of Computer Applications (0975 – 8887)
Volume 53– No.4, September 2012.
[8] Harish, Anil Kumar “Analysis of Educational Web Pattern
Using Adaptive Markov Chain For Next Page Access
Prediction” International Journal of Computer Science and
Information Security Publication July 2011, Volume 9 No. 7.
[9] Bindu Madhuri, Dr. Anand Chandulal.J, Ramya. K, Phanidra.M
“Analysis of Users’ Web Navigation Behavior using GRPA
with Variable Length Markov Chains” IJDKP.2011.1201.
[10] B.ramesh babu,R.jeyshankar “Websites of central university
in India: A webometric Analysis” DESIDC journal of libarary
and Information Technology,Vol30 no .4 july 2010.
[11] Harish, Anil Kumar “Clustering algorithm employee in web
usage mining: An overview” INDIACOMM-2011 ISSN 09737529 ISBN 978-93-80544-00-7

Fig. 7. Page visit Graph

AUTHOR PROFILE:

Fig. 8. Cluster Generation based on user identification

CONCLUSION

AND

Harish Kumar has completed his M.Tech (IT)
from Guru Gobind Singh Indraprastha
University, Delhi. He is currently pursuing his
Ph.D from Mewar University, Chittorgarh.

FUTURE WORKS

Web is one the main source of the information. The results
are based on the evaluation of 14 college’s web log files in
busy and normal working days. After evaluation we find that
fuzzy logic approach is more accurately define the cluster
and provide more accurate results and prediction model based
on the Markov chain and Bayesian theorem is more accurately
© 2013 ACEEE
DOI: 03.LSCS.2013.2. 563

Prof.(Dr.) Anil Kumar Solanki did his PhD in CSE
from Bundelkhand University. He has published
good number of papers in National and International journals.
72

Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement

Similar to Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement (20)

More from idescitation

More from idescitation (20)

Recently uploaded

Recently uploaded (20)

Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement