Web usage mining

Web Usage Mining
• Mining the behavior of human users
• Understand the customers
• Track the behavior and make
recommendations
• Customize the appearance
• Based on Click stream analysis
– This is the lowest level of data
– Needs to be aggregated to Session level data
12/3/2018 Professor V. Nagadevara

Web Usage Mining
• Analyze Click-stream Data
– From client or server point of view
• Used for
– Personalization
– Determine frequent access usage
– For caching
– Improve sales and advertisement

Sources of Data
• Web server log files
• Page tags
• Cookies

Types of click-stream data
• Site centric
– Server log files of a website
– Information on behavior within the website
– Information of cookie ID and IP address
– Lack information regarding activity on other sites
(competing sites?)

Web Server Log Files
• Also called click stream data
• The log files are customized by the server.
There are four general formats:
– NCSA Common Log (Access Log format),
– NCSA Combined Log,
– NCSA Separate Log, and
– W3C Extended Log

NCSA Common Log
• Includes the client IP address, client identifier,
visitor username, date and time, HTTP
request, status code for the request, and the
number of bytes transferred
• 172.21.100.30 – nagadev
[18/Dec/2013:11:25:15 +0530] “GET
/index.html HTTP/1.0” 200 1043

NCSA Combined Log
• common log plus
– the referring URL, the visitor’s Web browser and
operating system information, and the cookie
• 172.21.100.30 – nagadev [18/Dec/2013:11:25:15
+0530] “GET /index.html HTTP/1.0” 200 1043
“http://www.dataminingresources.blogspot.com”
“Mozilla/4.05 [en] (WinNT; I)”
“USERID=CustomerA; IMPID=01234”

NCSA Separate Log
• Same information as the combined log, but in
three separate files—the access log, the
referral
Common Log: 172.21.100.30 – nagadev
[18/Dec/2013:11:25:15 +0530] “GET /index.html
HTTP/1.0” 200 1043
Referral Log: [18/Dec/2013:11:25:15 +0530]
“http://www.dataminingresources.blogspot.com/ ”
Agent Log: [18/Dec/2013:11:25:15 +0530]
“Microsoft Internet Explorer - 7.0”

W3C Extended Log
• provide for better control and manipulation of data
while producing a log file readable by most Web
analytics tools
• #Software: Microsoft Internet Information Services 6.0
• #Version: 1.0
• #Date: 2009 -05-24 20:18:01
• #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-
query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
• 2009-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm -
200 7930 248 31
Mozilla/4.0+(compatible;+MSIE+7.01;+Windows+2000+Server)http://54.114.
24.224/

W3C Extended log
• Can be extended to customized fields
• #Software: Microsoft Internet Information Services 6.0
#Version: 1.0 #Date: 2002-05-24 20:18:01
• #Fields: date time c-ip cs-username s-ip s-port cs-method cs-
uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken
cs(User-Agent) cs(Referrer)
• 2002-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET
/Default.htm - 200 7930 248 31
Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server)
http://64.224.24.114/

Page Tags
• This is client-side data collection
• Tags (java scripts) are added to web pages
• When web pages are downloaded, the “tags” are
also downloaded
• These tags are then “executed” and info is sent to a
data center by sending a request for a small file,
appending a long query to the request – called “Web
Bug”
• Data center parses the query and send the file,
completing the transaction

Page Tags
• Tags can be customized
• Variables can be pre-determined and pre-
formatted
• Cookies can be dropped for unique identification
• Data can be parsed automatically
• More accurate because client-side. Crawlers
don’t really render pages!
• Data can be reported/analyzed in real time

Page Tags
• Issues
– Dependence on java scripts
– Adding tags to each page (Manual is very difficult)
– Adds “weight” to pages.
– Errors on pages or failed downloads
– Vendors do not like individual customization
– Ownership of data is an issue
– Privacy issues

Cookies
• Used for identifying the uniqueness of the
user
• Can be deleted or prevented
• First party cookie is dropped (served) directly
from the website
• Third party cookies are served from another
domain – eg. These can “observe” the user’s
behavior across multiple domains

Primary Groups of Data
• Usage data
• Content data
• Structure data
• User Data

Usage Data
• “Page View” is the most basic level
– “Aggregate representation of a collection of web
objects contributing to the display on a user’s
browser resulting from a single user action (click)”
– It is a collection of web objects or resources
representing a specific user event
– Eg. Reading an article, viewing a product list,
viewing a detailed list, adding an item to the cart

Usage Data
• Session
– “A session is a sequence of page views by a single
user during a single visit”
– We normally select a subset of page views that are
significant or relevant for the analysis

Content Data
• “Collection of objects and relationships that is
conveyed to the user”
• Consist of static pages, multimedia files,
dynamic page segments, records from
operational databases etc.
• Also include conceptual hierarchies such as
product categories

Structure Data
• “Represents designers view of the content
organization”
• Captured by the inter-page linkage structure
between pages
• These are reflected by hyper links

User Data
• Information regarding user profile
• Demographic information on registered users
• Past purchases
• Reviews and ratings
• Visit histories
• Anonymous information collected by cookies

Data Pre-processing
• Data Fusion and Cleaning
• Page View identification
• User identification
• Sessionization
• Path Completion
• Data Integration

Data Fusion and Cleaning
• Data is drawn from multiple web or application servers
• Data fusion is merging log files from different servers
• Cleaning involves removal of unnecessary data from log
files,
• Removal of Crawler navigation (by crawler name) or by
heuristics
• “Keynote”, a performance monitoring system accessed the source site
for KDD Cup 2000, three times per minute all day, every day!

Page View Identification
• Requires understanding of the structure of the
site, page contents, site domain knowledge
• Can be single file (one-to-one relationship
correspondence with page view)
• Can be a collection of objects, or dynamically
constructed page
• Can be hierarchical list (eg. Information pages,
product views, registration, shopping cart
changes, payment etc.)

User Identification
• Easy if the user has to login
• IP addresses are very accurate (Problem with
Proxy servers)
• Combination of IP address and browser
• More difficult across different sessions (multiple
machines and multiple users)
• Cookies are a possible option
– Different browsers
– Different computers
– Cookies are deleted!

Sessionization
• Process of identifying the page views
requested by a single user in a single session
• Find all page requests from the same user and
group them using heuristics
• Issue a “session id”
• Modify the URL in the log record to include
session id
• Decide when the session ended!

Sessionization
• Time oriented Heuristics
– Total session duration may not exceed Θ
– Total time on a page may not exceed δ
• Referrer oriented
– A request q is added to the session S if the referrer
for q is previously invoked in S
– Else q is the starting point for a new session

Sessionization
• Episode
– A subset of relevant page views in a session
– Comprising of functionally or semantically related
page views
– Requires classification of page views into
functional or concept categories

Path Completion
• The paths are incomplete
– Caching leads to missing entries
– Caching by proxy servers
– Back button creates missing links
• Session log contains time stamps which can be
mined
– Missing pages do not have time stamps
– Dynamic pages are unique and not cached!
• Requires knowledge of the site structure and referrer
information

Data Integration
• Pre-processing results in a set of sessions or episodes
• Other data (demographics, ratings, past purchases
etc.) needs to be integrated to lead to WA/BI metrics
such as customer conversion ratios, lifetime value
• Additional data – shopping cart changes, shipping and
address info, click throughs, impressions
• The transactional database is extracted into data marts
or OLAP cubes after certain amount of aggregation

Modeling
• Statistical Analysis
– Aggregated by pre-determined units (days, sessions,
visitors etc.)
– Most frequent pages, average view time, length of
path, entry and exit etc.
– Referrers, user agents, requested resources
– Usually presented in bar charts, tables and
comparative tables

Modeling
• Segmentation – use cluster analysis
• Associations and correlation analysis
• Frequent item-set graph
• Sequential and navigational patterns
• Predictive analytics using classification
techniques

Prof. Vishnuprasad Nagadevara
Indian Institute of Management Bangalore

Information from Web Analytics
 How many visitors visit the page daily?
 Who are the regular visitors?
 What percentage of the visitors to the page are registered users?
 What are the top pages that are visited on the web page?
 What is the average visit time on the website?
 How often does the visitor return to the site?
 What is the average page depth of a visitor?
 What is the geographic distribution of users of the website?
Web Analytics
Personilization
System
Improvement
Site
Modification
Business
Intelligence
Usage
characteristics

Objectives of the Study
• The objectives of this study are to
– Explore Web analytics and its usefulness to web
based business.
– Identify the techniques used in click stream
analysis.
– Identify the application of click stream analysis
through analyzing click stream data obtained from
a particular website using appropriate click stream
analysis techniques.

Methodology
• This study analyzes the click stream data obtained from a web site, which
specializes in an online information exchange service to facilitate
identification of suitable partners, in India and other countries.
• The site has a very different revenue model. The visitors are allowed to
browse through the site without any initial payment. The visitors are
allowed to look at the profiles of prospective partners free of charge. The
visitors will have to become members by making a one-time payment only
when they need to contact the prospective brides or grooms.
• Users can search for profiles through advanced search options on the site
on various preferences ranging from basic details of preferred partner to
lifestyle, career, education, profession etc.

Methodology
• Members can make initial contact with each other through services
available via Chat, SMS, and e-mail.
• Users can avail free registration on the website and are assured of
exclusive privacy and confidentiality. The website allows the users to
create their profiles, search for other profiles, and express interest in
other profiles and contact others. Registration and creating a profile is free
of cost.
• Registered users can become paid members that will allow them to
contact others, view contact details of other members, write personalized
messages, initiate chats and let other members view their contact details.
Paid memberships are provided for a specified duration.

Methodology
• The click stream data is analyzed to identify different
paths taken by the visitors and the sequence of
pages that lead to payment of membership fee.
Based on this analysis, specific strategies are
recommended to maximize the revenue for the
website.

DATA PREPARATION
Problem : Format of data
– Clickstream data files are neither delimited nor fixed length files
Solution:
– Used the date in the clickstream as the delimiter to import data to database
– Have to perform string handling in database to separate out the fields
10.208.65.96 172.16.8.37, 124.124.35.130 - - [23/May/2008:00:00:00 -0400] "GET
/billing/billing.php?user=&cid=22401528da14a61c43512fa025b59578i353273 HTTP/1.0" 200 1832
10.208.65.96 68.126.193.219 - - [23/May/2008:00:00:00 -0400] "GET /profile/js/common.js HTTP/1.1" 200 1246210.208.65.96
59.95.71.32 - - [23/May/2008:00:00:00 -0400] "GET /P/css/comm_style.css HTTP/1.1" 200 2640
10.208.65.96 122.163.70.145 - - [23/May/2008:00:00:00 -0400] "GET
/P/search.php?checksum=&searchchecksum=16465054&j=300&newsearch=&inf_checksum=&castemapping=&crmback=&searchorder
=T&label_select_no=&savesearch=&from_index=&viewall=&save_search_redirect=&hide_search_bar=y HTTP/1.1" 200 21561
10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26
10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1"
200 3333
10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum=
HTTP/1.1" 200 3329

Data
• Data is obtained from the site in the form of click stream
records. Each record consists of the details of clicks by the
visitors and each record contains the following details:
– Server IP
– Client IP
– Time stamp with Date
– Status: HTTP Status code
– URL requested: has three subfields namely The request method,
resource requested and the protocol used
– No. of bytes transferred
• The country of origin for a specific request is identified using
the IP address.

Data
• URL is used to identify the information/web page browsed by the
visitors.
• Time stamp of each click is used to sequence the movement of the
visitors across different pages in the website.
• Identifying a unique user session is an important step in the analysis
of click stream data. Inactivity for more than 30 minutes is
considered as a break of session.
• This is an approximation since there could be multiple users
accessing from the same IP, or the same user accessing from
different IPs.
• Due to lack of more data available we consider hits from each
unique IP as belonging to a unique user for a unique session.

No of Sessions
Day
Number of
sessions
Number of
clicks
Day 1 23,440 460,211
Day 2 22,717 453,977
Day 3 24,694 461,518

DATA PREPARATION
Problem 3: Volume of data
– Volume of data is huge. Performing string handling on this
volume hits performance
Solution:
– Convert data fields into non-string fields, dates as dates,
numbers as numbers etc..
– Remove unnecessary data (server IP)
– Process data in batches of 100000 records
– Database tuning, indexing and query tuning required
– Over 1500 lines of code written
– Processing still required more than 24hours run time
Day Number of
records
24-May-08 6285949
25-May-08 6061424
26-May-08 6298494

DATA PREPARATION
Analyzing information in the clickstream.
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum= HTTP/1.1" 200 3329
Field Descript ion
IPaddressof t he server
example: 10.208.65.96
IPaddressof t he client
example: 10.232.65.96, 10.232.49.1,
203.126.136.220
Dat e and t ime of click (server dat e t ime)
example: [23/ May/ 2008:00:00:00 -0400]
Request line exact ly asit came from t he
client . It has3 subfields, The request
met hod, resource request ed and t he
prot ocol used,
example: GET
/ profile/ mainmenu.php?checksum=
HTTP/ 1.1
Request met hod : GET
Resource :
/ profile/ mainmenu.php?checksum=
Prot ocol : HTTP/ 1.1
The HTTPst at uscode ret urned t o t he
client .
example:200
The cont ent -lengt h of t he document
t ransferred.
example: 3329
Server IP
Client IP
Dat e Time
URL
request ed
St at us
byt es

Data Preparation
• Getting additional information
– IP addresses allocation by country
– Website mapping (identifying key actions on the website)
– Identifying visitors, registered users and paid users through
the actions performed on the website
• Data transformation
– Extract client IP address
– Represent time as number of seconds past midnight
– Extract web action from the URL string
– Day of the week

Website Tagging
Mem_comparison

Data Preparation
• Session Identification
– Each unique client IP address is considered as a unique user
– A break of more than 30 minutes between clicks is considered
as the end of one session
– Clicks in a session are ordered by the time of occurrence
• Session Sampling
– Data volume is huge, need to select sample sessions for further
analysis
– Sessions having between 50 to 100 clicks are selected for
further analysis
– Only those records that relate to a specific user action are
retained, remaining records are discarded.

DATA PREPARATION
10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26
10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1" 200 3333
10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum= HTTP/1.1" 200 3329
Day Number of
sessions
Number of
clicks
24th
May 2008 23440 460211
25th
May 2008 22717 453977
26th
May 2008 24694 461518
Day Number of
records
24-May-08 6285949
25-May-08 6061424
26-May-08 6298494

DATA PREPARATION
Preparing data for Associations
Preparing data for Sequencing

DATA PREPARATION
Learnings :
Clickstream data should be processed at runtime or at least on a
daily basis. Processing this data in batches is not efficient
Have a mechanism to capture user ID of person logged on. This is
a very important information that is missing in the clickstream data

0
5000
10000
15000
20000
25000
30000
35000
5 20 50 100 200 500 1000 More
Bouncers and Serious Users
Clicks per IP

0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Number of clicks by hour of day
Number of clicks

0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
US
UK
Singapore
NZ
NULL
India
Europe
Australia
Asia Pacific
Countries By Hour

Exit Points
Last action performed in a session
0
1000
2000
3000
4000
5000
6000
7000
logoutview
profile
contact_hit_tryphotocheck
index
m
m
_show
m
sg
top_search_bandm
ainm
enu
contacts_m
ade_received
search_clustering
search
single_contact_aj
login
m
em
_com
parison
sim
profile_search

Web Diagram – Freq ≥ 19,000

Web Diagram – Freq ≥ 1,000

Associations
Consequent Antecedent
1
Antecedent 2 Antecedent 3 Antecedent 4 Support
%
Confidence
%
Payment = T Photorequest
=T
memcomp=T 100 73.1
Payment = T Country =
India
Photorequest=
T
memcomp=T 80 73
Payment = T Login=T Photorequest=
T
memcomp=T 60 73
Payment = T ViewProfile=
T
Photorequest=
T
memcomp=T 90 72.8
T
Login=T Photorequest=T memcomp=T 60 72.5
Payment = T Country =
India
ViewProfile=T Photorequest=T memcomp=T 70 71.4
Payment = T Mmshowmsg
= T
Photorequest=
T
memcomp=T 50 67.2
T
Mmshowmsg
= T
Photorequest=T memcomp=T 50 66.4

Summary and Conclusions
• Usage of the website by time of the day.
– This will help busy hour identification, and provide
information of the server capacity required for the
website, and when maintenance window can be
scheduled.
• Usage of website from different geographic location.
– This can provide the data of the distribution of users across
geographical locations
• Exit screens
– provide information on where the users exit from the
website. This input can help redesign the webpage if it
provides information on which pages are breaking the flow
of the user session.

Summary and Conclusions
• Most accessed and least accessed pages
– This can be used for variable pricing of advertisings on the
web page. This can also be used for better user interface
design and space utilization, by removing or repositioning the
links that are infrequently accessed.
• Associations
– Provide information on unique actions on the website and the
sequence in which the user has performed these actions. This
can be used in better user interface design.
• Web diagrams
– Gives information on co-occurrence of actions on the webpage
and their significance – also provides inputs on user interface
design.

Title
• QUESTIONS?

Web usage mining

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Web usage mining

Similar to Web usage mining (20)

Recently uploaded

Recently uploaded (20)

Web usage mining