2. Content
• Objective
• Description of the data
• Exploratory data analysis
• Model building
– Sequence rules
– Link analysis
– Probabilistic expert systems
– Markov chains
• Model comparison
• Summary report
3. Introduction
• Visitor behaviour on a website can be
predicted by analysing existing data on the
order in which the site’s webpages are visited
• Click flow is captured
• Every click of the mouse corresponds to the
viewing of a webpage.
• Clickstream as the sequence of webpages
requested
4. Objective
• To understand the most likely navigation paths in a website, with
the aim of
• predicting, possibly online, which pages a visitor will view, given the
path they
• have taken so far.
• Thisis be very useful in finding the probability that a visitor will view
a certain page, perhaps a buying page in an e-commerce site.
• It can also find the probability of entering (or exiting) the website
from any particular page.
• Note that since most pages are now dynamically generated, the
idea of viewing a particular page may need to be replaced with the
idea of viewing a particular class of page, or type of page; a class
could be defined by meta information in the header
6. • A log file for a period of about two years, 30
September 1997 to 30 June 1999.
• This data set contains the userid (c value), a
variable with the date and the instant the
visitor has linked to a specific page (c time)
and the webpage seen (c caller)
7. • Data set contains 250 711 observations, each
corresponding to a click, that describe the
navigation paths of 22 527 visitors among the 36
pages which compose the site of the webshop.
• The visitors are taken as unique; that is, no
visitors appears with more than one session. But
a page can occur more than once in the same
session. This data set is an example of a
transaction dataset.
15. further reduction in
the number of clusters leads to a
noticeable decrease in R2 and an
increase in
SPRSQ. This can be seen in Figure 8.3,
which plots R2 and SPRSQ versus the
number of groups in the hierarchical
agglomerative algorithm.
18. Link analysis
• Take the results from the
sequence rules and use link
analysis to build up a global
model.
• Consider all indirect sequences of
any order up to a maximum of 10.
• Link analysis considers each of
the obtained sequences as a row
• Observation in a data set called
link. It then counts how many of
the observations include a certain
sequence. This is called the count
of a sequence and is the
fundamental measure for link
analysis.