SlideShare a Scribd company logo
WEB USAGE MINING
12/3/2018
Web Usage Mining
• Mining the behavior of human users
• Understand the customers
• Track the behavior and make
recommendations
• Customize the appearance
• Based on Click stream analysis
– This is the lowest level of data
– Needs to be aggregated to Session level data
12/3/2018 Professor V. Nagadevara
Web Usage Mining
• Analyze Click-stream Data
– From client or server point of view
• Used for
– Personalization
– Determine frequent access usage
– For caching
– Improve sales and advertisement
12/3/2018 Professor V. Nagadevara
Sources of Data
• Web server log files
• Page tags
• Cookies
12/3/2018 Professor V. Nagadevara
Types of click-stream data
• Site centric
– Server log files of a website
– Information on behavior within the website
– Information of cookie ID and IP address
– Lack information regarding activity on other sites
(competing sites?)
12/3/2018 Professor V. Nagadevara
Web Server Log Files
• Also called click stream data
• The log files are customized by the server.
There are four general formats:
– NCSA Common Log (Access Log format),
– NCSA Combined Log,
– NCSA Separate Log, and
– W3C Extended Log
12/3/2018 Professor V. Nagadevara
NCSA Common Log
• Includes the client IP address, client identifier,
visitor username, date and time, HTTP
request, status code for the request, and the
number of bytes transferred
• 172.21.100.30 – nagadev
[18/Dec/2013:11:25:15 +0530] “GET
/index.html HTTP/1.0” 200 1043
12/3/2018 Professor V. Nagadevara
NCSA Combined Log
• common log plus
– the referring URL, the visitor’s Web browser and
operating system information, and the cookie
• 172.21.100.30 – nagadev [18/Dec/2013:11:25:15
+0530] “GET /index.html HTTP/1.0” 200 1043
“http://www.dataminingresources.blogspot.com”
“Mozilla/4.05 [en] (WinNT; I)”
“USERID=CustomerA; IMPID=01234”
12/3/2018 Professor V. Nagadevara
NCSA Separate Log
• Same information as the combined log, but in
three separate files—the access log, the
referral
Common Log: 172.21.100.30 – nagadev
[18/Dec/2013:11:25:15 +0530] “GET /index.html
HTTP/1.0” 200 1043
Referral Log: [18/Dec/2013:11:25:15 +0530]
“http://www.dataminingresources.blogspot.com/ ”
Agent Log: [18/Dec/2013:11:25:15 +0530]
“Microsoft Internet Explorer - 7.0”
12/3/2018 Professor V. Nagadevara
W3C Extended Log
• provide for better control and manipulation of data
while producing a log file readable by most Web
analytics tools
• #Software: Microsoft Internet Information Services 6.0
• #Version: 1.0
• #Date: 2009 -05-24 20:18:01
• #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-
query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
• 2009-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm -
200 7930 248 31
Mozilla/4.0+(compatible;+MSIE+7.01;+Windows+2000+Server)http://54.114.
24.224/
12/3/2018 Professor V. Nagadevara
W3C Extended log
• Can be extended to customized fields
• #Software: Microsoft Internet Information Services 6.0
#Version: 1.0 #Date: 2002-05-24 20:18:01
• #Fields: date time c-ip cs-username s-ip s-port cs-method cs-
uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken
cs(User-Agent) cs(Referrer)
• 2002-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET
/Default.htm - 200 7930 248 31
Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server)
http://64.224.24.114/
12/3/2018 Professor V. Nagadevara
Page Tags
• This is client-side data collection
• Tags (java scripts) are added to web pages
• When web pages are downloaded, the “tags” are
also downloaded
• These tags are then “executed” and info is sent to a
data center by sending a request for a small file,
appending a long query to the request – called “Web
Bug”
• Data center parses the query and send the file,
completing the transaction
12/3/2018 Professor V. Nagadevara
Page Tags
• Tags can be customized
• Variables can be pre-determined and pre-
formatted
• Cookies can be dropped for unique identification
• Data can be parsed automatically
• More accurate because client-side. Crawlers
don’t really render pages!
• Data can be reported/analyzed in real time
12/3/2018 Professor V. Nagadevara
Page Tags
• Issues
– Dependence on java scripts
– Adding tags to each page (Manual is very difficult)
– Adds “weight” to pages.
– Errors on pages or failed downloads
– Vendors do not like individual customization
– Ownership of data is an issue
– Privacy issues
12/3/2018 Professor V. Nagadevara
Cookies
• Used for identifying the uniqueness of the
user
• Can be deleted or prevented
• First party cookie is dropped (served) directly
from the website
• Third party cookies are served from another
domain – eg. These can “observe” the user’s
behavior across multiple domains
12/3/2018 Professor V. Nagadevara
Primary Groups of Data
• Usage data
• Content data
• Structure data
• User Data
12/3/2018 Professor V. Nagadevara
Usage Data
• “Page View” is the most basic level
– “Aggregate representation of a collection of web
objects contributing to the display on a user’s
browser resulting from a single user action (click)”
– It is a collection of web objects or resources
representing a specific user event
– Eg. Reading an article, viewing a product list,
viewing a detailed list, adding an item to the cart
12/3/2018 Professor V. Nagadevara
Usage Data
• Session
– “A session is a sequence of page views by a single
user during a single visit”
– We normally select a subset of page views that are
significant or relevant for the analysis
12/3/2018 Professor V. Nagadevara
Content Data
• “Collection of objects and relationships that is
conveyed to the user”
• Consist of static pages, multimedia files,
dynamic page segments, records from
operational databases etc.
• Also include conceptual hierarchies such as
product categories
12/3/2018 Professor V. Nagadevara
Structure Data
• “Represents designers view of the content
organization”
• Captured by the inter-page linkage structure
between pages
• These are reflected by hyper links
12/3/2018 Professor V. Nagadevara
User Data
• Information regarding user profile
• Demographic information on registered users
• Past purchases
• Reviews and ratings
• Visit histories
• Anonymous information collected by cookies
12/3/2018 Professor V. Nagadevara
Data Pre-processing
• Data Fusion and Cleaning
• Page View identification
• User identification
• Sessionization
• Path Completion
• Data Integration
12/3/2018 Professor V. Nagadevara
Data Fusion and Cleaning
• Data is drawn from multiple web or application servers
• Data fusion is merging log files from different servers
• Cleaning involves removal of unnecessary data from log
files,
• Removal of Crawler navigation (by crawler name) or by
heuristics
• “Keynote”, a performance monitoring system accessed the source site
for KDD Cup 2000, three times per minute all day, every day!
12/3/2018 Professor V. Nagadevara
Page View Identification
• Requires understanding of the structure of the
site, page contents, site domain knowledge
• Can be single file (one-to-one relationship
correspondence with page view)
• Can be a collection of objects, or dynamically
constructed page
• Can be hierarchical list (eg. Information pages,
product views, registration, shopping cart
changes, payment etc.)
12/3/2018 Professor V. Nagadevara
User Identification
• Easy if the user has to login
• IP addresses are very accurate (Problem with
Proxy servers)
• Combination of IP address and browser
• More difficult across different sessions (multiple
machines and multiple users)
• Cookies are a possible option
– Different browsers
– Different computers
– Cookies are deleted!
12/3/2018 Professor V. Nagadevara
Sessionization
• Process of identifying the page views
requested by a single user in a single session
• Find all page requests from the same user and
group them using heuristics
• Issue a “session id”
• Modify the URL in the log record to include
session id
• Decide when the session ended!
12/3/2018 Professor V. Nagadevara
Sessionization
• Time oriented Heuristics
– Total session duration may not exceed Θ
– Total time on a page may not exceed δ
• Referrer oriented
– A request q is added to the session S if the referrer
for q is previously invoked in S
– Else q is the starting point for a new session
12/3/2018 Professor V. Nagadevara
Sessionization
• Episode
– A subset of relevant page views in a session
– Comprising of functionally or semantically related
page views
– Requires classification of page views into
functional or concept categories
12/3/2018 Professor V. Nagadevara
Path Completion
• The paths are incomplete
– Caching leads to missing entries
– Caching by proxy servers
– Back button creates missing links
• Session log contains time stamps which can be
mined
– Missing pages do not have time stamps
– Dynamic pages are unique and not cached!
• Requires knowledge of the site structure and referrer
information
12/3/2018 Professor V. Nagadevara
Data Integration
• Pre-processing results in a set of sessions or episodes
• Other data (demographics, ratings, past purchases
etc.) needs to be integrated to lead to WA/BI metrics
such as customer conversion ratios, lifetime value
• Additional data – shopping cart changes, shipping and
address info, click throughs, impressions
• The transactional database is extracted into data marts
or OLAP cubes after certain amount of aggregation
12/3/2018 Professor V. Nagadevara
Modeling
• Statistical Analysis
– Aggregated by pre-determined units (days, sessions,
visitors etc.)
– Most frequent pages, average view time, length of
path, entry and exit etc.
– Referrers, user agents, requested resources
– Usually presented in bar charts, tables and
comparative tables
12/3/2018 Professor V. Nagadevara
Modeling
• Segmentation – use cluster analysis
• Associations and correlation analysis
• Frequent item-set graph
• Sequential and navigational patterns
• Predictive analytics using classification
techniques
12/3/2018 Professor V. Nagadevara
Prof. Vishnuprasad Nagadevara
Indian Institute of Management Bangalore
Information from Web Analytics
 How many visitors visit the page daily?
 Who are the regular visitors?
 What percentage of the visitors to the page are registered users?
 What are the top pages that are visited on the web page?
 What is the average visit time on the website?
 How often does the visitor return to the site?
 What is the average page depth of a visitor?
 What is the geographic distribution of users of the website?
Web Analytics
Personilization
System
Improvement
Site
Modification
Business
Intelligence
Usage
characteristics
Objectives of the Study
• The objectives of this study are to
– Explore Web analytics and its usefulness to web
based business.
– Identify the techniques used in click stream
analysis.
– Identify the application of click stream analysis
through analyzing click stream data obtained from
a particular website using appropriate click stream
analysis techniques.
Methodology
• This study analyzes the click stream data obtained from a web site, which
specializes in an online information exchange service to facilitate
identification of suitable partners, in India and other countries.
• The site has a very different revenue model. The visitors are allowed to
browse through the site without any initial payment. The visitors are
allowed to look at the profiles of prospective partners free of charge. The
visitors will have to become members by making a one-time payment only
when they need to contact the prospective brides or grooms.
• Users can search for profiles through advanced search options on the site
on various preferences ranging from basic details of preferred partner to
lifestyle, career, education, profession etc.
Methodology
• Members can make initial contact with each other through services
available via Chat, SMS, and e-mail.
• Users can avail free registration on the website and are assured of
exclusive privacy and confidentiality. The website allows the users to
create their profiles, search for other profiles, and express interest in
other profiles and contact others. Registration and creating a profile is free
of cost.
• Registered users can become paid members that will allow them to
contact others, view contact details of other members, write personalized
messages, initiate chats and let other members view their contact details.
Paid memberships are provided for a specified duration.
Methodology
• The click stream data is analyzed to identify different
paths taken by the visitors and the sequence of
pages that lead to payment of membership fee.
Based on this analysis, specific strategies are
recommended to maximize the revenue for the
website.
DATA PREPARATION
Problem : Format of data
– Clickstream data files are neither delimited nor fixed length files
Solution:
– Used the date in the clickstream as the delimiter to import data to database
– Have to perform string handling in database to separate out the fields
10.208.65.96 172.16.8.37, 124.124.35.130 - - [23/May/2008:00:00:00 -0400] "GET
/billing/billing.php?user=&cid=22401528da14a61c43512fa025b59578i353273 HTTP/1.0" 200 1832
10.208.65.96 68.126.193.219 - - [23/May/2008:00:00:00 -0400] "GET /profile/js/common.js HTTP/1.1" 200 1246210.208.65.96
59.95.71.32 - - [23/May/2008:00:00:00 -0400] "GET /P/css/comm_style.css HTTP/1.1" 200 2640
10.208.65.96 122.163.70.145 - - [23/May/2008:00:00:00 -0400] "GET
/P/search.php?checksum=&searchchecksum=16465054&j=300&newsearch=&inf_checksum=&castemapping=&crmback=&searchorder
=T&label_select_no=&savesearch=&from_index=&viewall=&save_search_redirect=&hide_search_bar=y HTTP/1.1" 200 21561
10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26
10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1"
200 3333
10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum=
HTTP/1.1" 200 3329
Data
• Data is obtained from the site in the form of click stream
records. Each record consists of the details of clicks by the
visitors and each record contains the following details:
– Server IP
– Client IP
– Time stamp with Date
– Status: HTTP Status code
– URL requested: has three subfields namely The request method,
resource requested and the protocol used
– No. of bytes transferred
• The country of origin for a specific request is identified using
the IP address.
Data
• URL is used to identify the information/web page browsed by the
visitors.
• Time stamp of each click is used to sequence the movement of the
visitors across different pages in the website.
• Identifying a unique user session is an important step in the analysis
of click stream data. Inactivity for more than 30 minutes is
considered as a break of session.
• This is an approximation since there could be multiple users
accessing from the same IP, or the same user accessing from
different IPs.
• Due to lack of more data available we consider hits from each
unique IP as belonging to a unique user for a unique session.
No of Sessions
Day
Number of
sessions
Number of
clicks
Day 1 23,440 460,211
Day 2 22,717 453,977
Day 3 24,694 461,518
DATA PREPARATION
Problem 3: Volume of data
– Volume of data is huge. Performing string handling on this
volume hits performance
Solution:
– Convert data fields into non-string fields, dates as dates,
numbers as numbers etc..
– Remove unnecessary data (server IP)
– Process data in batches of 100000 records
– Database tuning, indexing and query tuning required
– Over 1500 lines of code written
– Processing still required more than 24hours run time
Day Number of
records
24-May-08 6285949
25-May-08 6061424
26-May-08 6298494
DATA PREPARATION
Analyzing information in the clickstream.
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum= HTTP/1.1" 200 3329
Field Descript ion
IPaddressof t he server
example: 10.208.65.96
IPaddressof t he client
example: 10.232.65.96, 10.232.49.1,
203.126.136.220
Dat e and t ime of click (server dat e t ime)
example: [23/ May/ 2008:00:00:00 -0400]
Request line exact ly asit came from t he
client . It has3 subfields, The request
met hod, resource request ed and t he
prot ocol used,
example: GET
/ profile/ mainmenu.php?checksum=
HTTP/ 1.1
Request met hod : GET
Resource :
/ profile/ mainmenu.php?checksum=
Prot ocol : HTTP/ 1.1
The HTTPst at uscode ret urned t o t he
client .
example:200
The cont ent -lengt h of t he document
t ransferred.
example: 3329
Server IP
Client IP
Dat e Time
URL
request ed
St at us
byt es
Data Preparation
• Getting additional information
– IP addresses allocation by country
– Website mapping (identifying key actions on the website)
– Identifying visitors, registered users and paid users through
the actions performed on the website
• Data transformation
– Extract client IP address
– Represent time as number of seconds past midnight
– Extract web action from the URL string
– Day of the week
Website Tagging
Website Tagging
Website Tagging
payment
Website Tagging
Mem_comparison
Data Preparation
• Session Identification
– Each unique client IP address is considered as a unique user
– A break of more than 30 minutes between clicks is considered
as the end of one session
– Clicks in a session are ordered by the time of occurrence
• Session Sampling
– Data volume is huge, need to select sample sessions for further
analysis
– Sessions having between 50 to 100 clicks are selected for
further analysis
– Only those records that relate to a specific user action are
retained, remaining records are discarded.
DATA PREPARATION
10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26
10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET
/profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1" 200 3333
10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26
10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum= HTTP/1.1" 200 3329
Day Number of
sessions
Number of
clicks
24th
May 2008 23440 460211
25th
May 2008 22717 453977
26th
May 2008 24694 461518
Day Number of
records
24-May-08 6285949
25-May-08 6061424
26-May-08 6298494
DATA PREPARATION
Preparing data for Associations
Preparing data for Sequencing
DATA PREPARATION
Learnings :
Clickstream data should be processed at runtime or at least on a
daily basis. Processing this data in batches is not efficient
Have a mechanism to capture user ID of person logged on. This is
a very important information that is missing in the clickstream data
0
5000
10000
15000
20000
25000
30000
35000
5 20 50 100 200 500 1000 More
Bouncers and Serious Users
Clicks per IP
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Number of clicks by hour of day
Number of clicks
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
US
UK
Singapore
NZ
NULL
India
Europe
Australia
Asia Pacific
Countries By Hour
Exit Points
Last action performed in a session
0
1000
2000
3000
4000
5000
6000
7000
logoutview
profile
contact_hit_tryphotocheck
index
m
m
_show
m
sg
top_search_bandm
ainm
enu
contacts_m
ade_received
search_clustering
search
single_contact_aj
login
m
em
_com
parison
sim
profile_search
Different Pages Accessed
Web Diagram – Freq ≥ 19,000
Web Diagram – Freq ≥ 1,000
Associations
Consequent Antecedent
1
Antecedent 2 Antecedent 3 Antecedent 4 Support
%
Confidence
%
Payment = T Photorequest
=T
memcomp=T 100 73.1
Payment = T Country =
India
Photorequest=
T
memcomp=T 80 73
Payment = T Login=T Photorequest=
T
memcomp=T 60 73
Payment = T ViewProfile=
T
Photorequest=
T
memcomp=T 90 72.8
Payment = T ViewProfile=
T
Login=T Photorequest=T memcomp=T 60 72.5
Payment = T Country =
India
ViewProfile=T Photorequest=T memcomp=T 70 71.4
Payment = T Mmshowmsg
= T
Photorequest=
T
memcomp=T 50 67.2
Payment = T ViewProfile=
T
Mmshowmsg
= T
Photorequest=T memcomp=T 50 66.4
Summary and Conclusions
• Usage of the website by time of the day.
– This will help busy hour identification, and provide
information of the server capacity required for the
website, and when maintenance window can be
scheduled.
• Usage of website from different geographic location.
– This can provide the data of the distribution of users across
geographical locations
• Exit screens
– provide information on where the users exit from the
website. This input can help redesign the webpage if it
provides information on which pages are breaking the flow
of the user session.
Summary and Conclusions
• Most accessed and least accessed pages
– This can be used for variable pricing of advertisings on the
web page. This can also be used for better user interface
design and space utilization, by removing or repositioning the
links that are infrequently accessed.
• Associations
– Provide information on unique actions on the website and the
sequence in which the user has performed these actions. This
can be used in better user interface design.
• Web diagrams
– Gives information on co-occurrence of actions on the webpage
and their significance – also provides inputs on user interface
design.
Title
• QUESTIONS?
12/3/2018 Professor V. Nagadevara

More Related Content

What's hot

(27.05) MOSSCA Invita - Búsqueda empresarial 2
(27.05) MOSSCA Invita - Búsqueda empresarial 2(27.05) MOSSCA Invita - Búsqueda empresarial 2
(27.05) MOSSCA Invita - Búsqueda empresarial 2
Microsoft Argentina y Uruguay [Official Space]
 
Using sharepoint to solve business problems #spsnairobi2014
Using sharepoint to solve business problems #spsnairobi2014Using sharepoint to solve business problems #spsnairobi2014
Using sharepoint to solve business problems #spsnairobi2014
Amos Wachanga
 
Top ten new ECM features in SharePoint 2013
Top ten new ECM features in SharePoint 2013Top ten new ECM features in SharePoint 2013
Top ten new ECM features in SharePoint 2013
John F. Holliday
 
Enterprise Document Management in SharePoint 2010
Enterprise Document Management in SharePoint 2010Enterprise Document Management in SharePoint 2010
Enterprise Document Management in SharePoint 2010
Agnes Molnar
 
Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013
Chris McNulty
 
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
Jonathan Ralton
 
SPS Philly Architecting a Content Management Solution
SPS Philly Architecting a Content Management SolutionSPS Philly Architecting a Content Management Solution
SPS Philly Architecting a Content Management Solution
Patrick Tucker
 
3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices
puckmiller3
 
Nadee2018
Nadee2018Nadee2018
Nadee2018
SharadPatil81
 
Real world rm in share point 2013
Real world rm in share point 2013Real world rm in share point 2013
Real world rm in share point 2013
C/D/H Technology Consultants
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?
John F. Holliday
 
Web mining
Web miningWeb mining
Web mining
SwarnaLatha177
 
Web Mining
Web MiningWeb Mining
Web Mining
Mudit Dholakia
 
Share point document management
Share point document managementShare point document management
Share point document management
Peter Kettenis
 
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
Bill England
 
SharePoint 2010 for Document Compliance
SharePoint 2010 for Document ComplianceSharePoint 2010 for Document Compliance
SharePoint 2010 for Document Compliance
ntenany
 
6. Sim Fanji database dan manajemen informasi
6. Sim Fanji database dan manajemen informasi 6. Sim Fanji database dan manajemen informasi
6. Sim Fanji database dan manajemen informasi
Yoyo Sudaryo
 

What's hot (17)

(27.05) MOSSCA Invita - Búsqueda empresarial 2
(27.05) MOSSCA Invita - Búsqueda empresarial 2(27.05) MOSSCA Invita - Búsqueda empresarial 2
(27.05) MOSSCA Invita - Búsqueda empresarial 2
 
Using sharepoint to solve business problems #spsnairobi2014
Using sharepoint to solve business problems #spsnairobi2014Using sharepoint to solve business problems #spsnairobi2014
Using sharepoint to solve business problems #spsnairobi2014
 
Top ten new ECM features in SharePoint 2013
Top ten new ECM features in SharePoint 2013Top ten new ECM features in SharePoint 2013
Top ten new ECM features in SharePoint 2013
 
Enterprise Document Management in SharePoint 2010
Enterprise Document Management in SharePoint 2010Enterprise Document Management in SharePoint 2010
Enterprise Document Management in SharePoint 2010
 
Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013
 
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...
 
SPS Philly Architecting a Content Management Solution
SPS Philly Architecting a Content Management SolutionSPS Philly Architecting a Content Management Solution
SPS Philly Architecting a Content Management Solution
 
3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices
 
Nadee2018
Nadee2018Nadee2018
Nadee2018
 
Real world rm in share point 2013
Real world rm in share point 2013Real world rm in share point 2013
Real world rm in share point 2013
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Share point document management
Share point document managementShare point document management
Share point document management
 
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
Aiim Seminar - SharePoint Crossroads May 23 - Bending but Not Breaking - Spea...
 
SharePoint 2010 for Document Compliance
SharePoint 2010 for Document ComplianceSharePoint 2010 for Document Compliance
SharePoint 2010 for Document Compliance
 
6. Sim Fanji database dan manajemen informasi
6. Sim Fanji database dan manajemen informasi 6. Sim Fanji database dan manajemen informasi
6. Sim Fanji database dan manajemen informasi
 

Similar to Web usage mining

Web usage mining
Web usage miningWeb usage mining
Web usage mining
shabnamfsayyad
 
Web mining
Web miningWeb mining
Web mining
Sumit Sony
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
Daminda Herath
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
Daminda Herath
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Feburary 2015 MNSPUG - Administering Your SharePoint Environment
Feburary 2015 MNSPUG - Administering Your SharePoint EnvironmentFeburary 2015 MNSPUG - Administering Your SharePoint Environment
Feburary 2015 MNSPUG - Administering Your SharePoint Environment
Minnesota SharePoint Users Group
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
geektimecoil
 
Web content mining
Web content miningWeb content mining
Web content mining
Sumit Sony
 
Dealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your EnterpriseDealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your Enterprise
WSO2
 
Pixel tags and tag management
Pixel tags and tag managementPixel tags and tag management
Pixel tags and tag management
Vijay Sankar
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 
Splunk Digital Intelligence
Splunk Digital IntelligenceSplunk Digital Intelligence
Splunk Digital Intelligence
Dmitry Anoshin
 
Charlotte SPUG - Planning for MySites and Social in the Enterprise
Charlotte SPUG - Planning for MySites and Social in the EnterpriseCharlotte SPUG - Planning for MySites and Social in the Enterprise
Charlotte SPUG - Planning for MySites and Social in the Enterprise
Michael Oryszak
 
Data Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: CloudData Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: Cloud
DATAVERSITY
 
Data Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: CloudData Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: Cloud
Data Blueprint
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
Marek Maśko
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
sopekmir
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
MakoLab SA
 
datamarts.ppt
datamarts.pptdatamarts.ppt
datamarts.ppt
bhavyag24
 
Database & Database Users
Database & Database UsersDatabase & Database Users
Database & Database Users
M.Zalmai Rahmani
 

Similar to Web usage mining (20)

Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Web mining
Web miningWeb mining
Web mining
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Feburary 2015 MNSPUG - Administering Your SharePoint Environment
Feburary 2015 MNSPUG - Administering Your SharePoint EnvironmentFeburary 2015 MNSPUG - Administering Your SharePoint Environment
Feburary 2015 MNSPUG - Administering Your SharePoint Environment
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Dealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your EnterpriseDealing with Common Data Requirements in Your Enterprise
Dealing with Common Data Requirements in Your Enterprise
 
Pixel tags and tag management
Pixel tags and tag managementPixel tags and tag management
Pixel tags and tag management
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Splunk Digital Intelligence
Splunk Digital IntelligenceSplunk Digital Intelligence
Splunk Digital Intelligence
 
Charlotte SPUG - Planning for MySites and Social in the Enterprise
Charlotte SPUG - Planning for MySites and Social in the EnterpriseCharlotte SPUG - Planning for MySites and Social in the Enterprise
Charlotte SPUG - Planning for MySites and Social in the Enterprise
 
Data Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: CloudData Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: Cloud
 
Data Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: CloudData Systems Integration & Business Value Pt. 2: Cloud
Data Systems Integration & Business Value Pt. 2: Cloud
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
datamarts.ppt
datamarts.pptdatamarts.ppt
datamarts.ppt
 
Database & Database Users
Database & Database UsersDatabase & Database Users
Database & Database Users
 

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 

Web usage mining

  • 2. Web Usage Mining • Mining the behavior of human users • Understand the customers • Track the behavior and make recommendations • Customize the appearance • Based on Click stream analysis – This is the lowest level of data – Needs to be aggregated to Session level data 12/3/2018 Professor V. Nagadevara
  • 3. Web Usage Mining • Analyze Click-stream Data – From client or server point of view • Used for – Personalization – Determine frequent access usage – For caching – Improve sales and advertisement 12/3/2018 Professor V. Nagadevara
  • 4. Sources of Data • Web server log files • Page tags • Cookies 12/3/2018 Professor V. Nagadevara
  • 5. Types of click-stream data • Site centric – Server log files of a website – Information on behavior within the website – Information of cookie ID and IP address – Lack information regarding activity on other sites (competing sites?) 12/3/2018 Professor V. Nagadevara
  • 6. Web Server Log Files • Also called click stream data • The log files are customized by the server. There are four general formats: – NCSA Common Log (Access Log format), – NCSA Combined Log, – NCSA Separate Log, and – W3C Extended Log 12/3/2018 Professor V. Nagadevara
  • 7. NCSA Common Log • Includes the client IP address, client identifier, visitor username, date and time, HTTP request, status code for the request, and the number of bytes transferred • 172.21.100.30 – nagadev [18/Dec/2013:11:25:15 +0530] “GET /index.html HTTP/1.0” 200 1043 12/3/2018 Professor V. Nagadevara
  • 8. NCSA Combined Log • common log plus – the referring URL, the visitor’s Web browser and operating system information, and the cookie • 172.21.100.30 – nagadev [18/Dec/2013:11:25:15 +0530] “GET /index.html HTTP/1.0” 200 1043 “http://www.dataminingresources.blogspot.com” “Mozilla/4.05 [en] (WinNT; I)” “USERID=CustomerA; IMPID=01234” 12/3/2018 Professor V. Nagadevara
  • 9. NCSA Separate Log • Same information as the combined log, but in three separate files—the access log, the referral Common Log: 172.21.100.30 – nagadev [18/Dec/2013:11:25:15 +0530] “GET /index.html HTTP/1.0” 200 1043 Referral Log: [18/Dec/2013:11:25:15 +0530] “http://www.dataminingresources.blogspot.com/ ” Agent Log: [18/Dec/2013:11:25:15 +0530] “Microsoft Internet Explorer - 7.0” 12/3/2018 Professor V. Nagadevara
  • 10. W3C Extended Log • provide for better control and manipulation of data while producing a log file readable by most Web analytics tools • #Software: Microsoft Internet Information Services 6.0 • #Version: 1.0 • #Date: 2009 -05-24 20:18:01 • #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri- query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer) • 2009-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm - 200 7930 248 31 Mozilla/4.0+(compatible;+MSIE+7.01;+Windows+2000+Server)http://54.114. 24.224/ 12/3/2018 Professor V. Nagadevara
  • 11. W3C Extended log • Can be extended to customized fields • #Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2002-05-24 20:18:01 • #Fields: date time c-ip cs-username s-ip s-port cs-method cs- uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer) • 2002-05-24 20:18:01 172.224.24.114 - 206.73.118.24 80 GET /Default.htm - 200 7930 248 31 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server) http://64.224.24.114/ 12/3/2018 Professor V. Nagadevara
  • 12. Page Tags • This is client-side data collection • Tags (java scripts) are added to web pages • When web pages are downloaded, the “tags” are also downloaded • These tags are then “executed” and info is sent to a data center by sending a request for a small file, appending a long query to the request – called “Web Bug” • Data center parses the query and send the file, completing the transaction 12/3/2018 Professor V. Nagadevara
  • 13. Page Tags • Tags can be customized • Variables can be pre-determined and pre- formatted • Cookies can be dropped for unique identification • Data can be parsed automatically • More accurate because client-side. Crawlers don’t really render pages! • Data can be reported/analyzed in real time 12/3/2018 Professor V. Nagadevara
  • 14. Page Tags • Issues – Dependence on java scripts – Adding tags to each page (Manual is very difficult) – Adds “weight” to pages. – Errors on pages or failed downloads – Vendors do not like individual customization – Ownership of data is an issue – Privacy issues 12/3/2018 Professor V. Nagadevara
  • 15. Cookies • Used for identifying the uniqueness of the user • Can be deleted or prevented • First party cookie is dropped (served) directly from the website • Third party cookies are served from another domain – eg. These can “observe” the user’s behavior across multiple domains 12/3/2018 Professor V. Nagadevara
  • 16. Primary Groups of Data • Usage data • Content data • Structure data • User Data 12/3/2018 Professor V. Nagadevara
  • 17. Usage Data • “Page View” is the most basic level – “Aggregate representation of a collection of web objects contributing to the display on a user’s browser resulting from a single user action (click)” – It is a collection of web objects or resources representing a specific user event – Eg. Reading an article, viewing a product list, viewing a detailed list, adding an item to the cart 12/3/2018 Professor V. Nagadevara
  • 18. Usage Data • Session – “A session is a sequence of page views by a single user during a single visit” – We normally select a subset of page views that are significant or relevant for the analysis 12/3/2018 Professor V. Nagadevara
  • 19. Content Data • “Collection of objects and relationships that is conveyed to the user” • Consist of static pages, multimedia files, dynamic page segments, records from operational databases etc. • Also include conceptual hierarchies such as product categories 12/3/2018 Professor V. Nagadevara
  • 20. Structure Data • “Represents designers view of the content organization” • Captured by the inter-page linkage structure between pages • These are reflected by hyper links 12/3/2018 Professor V. Nagadevara
  • 21. User Data • Information regarding user profile • Demographic information on registered users • Past purchases • Reviews and ratings • Visit histories • Anonymous information collected by cookies 12/3/2018 Professor V. Nagadevara
  • 22. Data Pre-processing • Data Fusion and Cleaning • Page View identification • User identification • Sessionization • Path Completion • Data Integration 12/3/2018 Professor V. Nagadevara
  • 23. Data Fusion and Cleaning • Data is drawn from multiple web or application servers • Data fusion is merging log files from different servers • Cleaning involves removal of unnecessary data from log files, • Removal of Crawler navigation (by crawler name) or by heuristics • “Keynote”, a performance monitoring system accessed the source site for KDD Cup 2000, three times per minute all day, every day! 12/3/2018 Professor V. Nagadevara
  • 24. Page View Identification • Requires understanding of the structure of the site, page contents, site domain knowledge • Can be single file (one-to-one relationship correspondence with page view) • Can be a collection of objects, or dynamically constructed page • Can be hierarchical list (eg. Information pages, product views, registration, shopping cart changes, payment etc.) 12/3/2018 Professor V. Nagadevara
  • 25. User Identification • Easy if the user has to login • IP addresses are very accurate (Problem with Proxy servers) • Combination of IP address and browser • More difficult across different sessions (multiple machines and multiple users) • Cookies are a possible option – Different browsers – Different computers – Cookies are deleted! 12/3/2018 Professor V. Nagadevara
  • 26. Sessionization • Process of identifying the page views requested by a single user in a single session • Find all page requests from the same user and group them using heuristics • Issue a “session id” • Modify the URL in the log record to include session id • Decide when the session ended! 12/3/2018 Professor V. Nagadevara
  • 27. Sessionization • Time oriented Heuristics – Total session duration may not exceed Θ – Total time on a page may not exceed δ • Referrer oriented – A request q is added to the session S if the referrer for q is previously invoked in S – Else q is the starting point for a new session 12/3/2018 Professor V. Nagadevara
  • 28. Sessionization • Episode – A subset of relevant page views in a session – Comprising of functionally or semantically related page views – Requires classification of page views into functional or concept categories 12/3/2018 Professor V. Nagadevara
  • 29. Path Completion • The paths are incomplete – Caching leads to missing entries – Caching by proxy servers – Back button creates missing links • Session log contains time stamps which can be mined – Missing pages do not have time stamps – Dynamic pages are unique and not cached! • Requires knowledge of the site structure and referrer information 12/3/2018 Professor V. Nagadevara
  • 30. Data Integration • Pre-processing results in a set of sessions or episodes • Other data (demographics, ratings, past purchases etc.) needs to be integrated to lead to WA/BI metrics such as customer conversion ratios, lifetime value • Additional data – shopping cart changes, shipping and address info, click throughs, impressions • The transactional database is extracted into data marts or OLAP cubes after certain amount of aggregation 12/3/2018 Professor V. Nagadevara
  • 31. Modeling • Statistical Analysis – Aggregated by pre-determined units (days, sessions, visitors etc.) – Most frequent pages, average view time, length of path, entry and exit etc. – Referrers, user agents, requested resources – Usually presented in bar charts, tables and comparative tables 12/3/2018 Professor V. Nagadevara
  • 32. Modeling • Segmentation – use cluster analysis • Associations and correlation analysis • Frequent item-set graph • Sequential and navigational patterns • Predictive analytics using classification techniques 12/3/2018 Professor V. Nagadevara
  • 33. Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
  • 34. Information from Web Analytics  How many visitors visit the page daily?  Who are the regular visitors?  What percentage of the visitors to the page are registered users?  What are the top pages that are visited on the web page?  What is the average visit time on the website?  How often does the visitor return to the site?  What is the average page depth of a visitor?  What is the geographic distribution of users of the website? Web Analytics Personilization System Improvement Site Modification Business Intelligence Usage characteristics
  • 35. Objectives of the Study • The objectives of this study are to – Explore Web analytics and its usefulness to web based business. – Identify the techniques used in click stream analysis. – Identify the application of click stream analysis through analyzing click stream data obtained from a particular website using appropriate click stream analysis techniques.
  • 36. Methodology • This study analyzes the click stream data obtained from a web site, which specializes in an online information exchange service to facilitate identification of suitable partners, in India and other countries. • The site has a very different revenue model. The visitors are allowed to browse through the site without any initial payment. The visitors are allowed to look at the profiles of prospective partners free of charge. The visitors will have to become members by making a one-time payment only when they need to contact the prospective brides or grooms. • Users can search for profiles through advanced search options on the site on various preferences ranging from basic details of preferred partner to lifestyle, career, education, profession etc.
  • 37. Methodology • Members can make initial contact with each other through services available via Chat, SMS, and e-mail. • Users can avail free registration on the website and are assured of exclusive privacy and confidentiality. The website allows the users to create their profiles, search for other profiles, and express interest in other profiles and contact others. Registration and creating a profile is free of cost. • Registered users can become paid members that will allow them to contact others, view contact details of other members, write personalized messages, initiate chats and let other members view their contact details. Paid memberships are provided for a specified duration.
  • 38. Methodology • The click stream data is analyzed to identify different paths taken by the visitors and the sequence of pages that lead to payment of membership fee. Based on this analysis, specific strategies are recommended to maximize the revenue for the website.
  • 39. DATA PREPARATION Problem : Format of data – Clickstream data files are neither delimited nor fixed length files Solution: – Used the date in the clickstream as the delimiter to import data to database – Have to perform string handling in database to separate out the fields 10.208.65.96 172.16.8.37, 124.124.35.130 - - [23/May/2008:00:00:00 -0400] "GET /billing/billing.php?user=&cid=22401528da14a61c43512fa025b59578i353273 HTTP/1.0" 200 1832 10.208.65.96 68.126.193.219 - - [23/May/2008:00:00:00 -0400] "GET /profile/js/common.js HTTP/1.1" 200 1246210.208.65.96 59.95.71.32 - - [23/May/2008:00:00:00 -0400] "GET /P/css/comm_style.css HTTP/1.1" 200 2640 10.208.65.96 122.163.70.145 - - [23/May/2008:00:00:00 -0400] "GET /P/search.php?checksum=&searchchecksum=16465054&j=300&newsearch=&inf_checksum=&castemapping=&crmback=&searchorder =T&label_select_no=&savesearch=&from_index=&viewall=&save_search_redirect=&hide_search_bar=y HTTP/1.1" 200 21561 10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26 10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1" 200 3333 10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26 10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum= HTTP/1.1" 200 3329
  • 40. Data • Data is obtained from the site in the form of click stream records. Each record consists of the details of clicks by the visitors and each record contains the following details: – Server IP – Client IP – Time stamp with Date – Status: HTTP Status code – URL requested: has three subfields namely The request method, resource requested and the protocol used – No. of bytes transferred • The country of origin for a specific request is identified using the IP address.
  • 41. Data • URL is used to identify the information/web page browsed by the visitors. • Time stamp of each click is used to sequence the movement of the visitors across different pages in the website. • Identifying a unique user session is an important step in the analysis of click stream data. Inactivity for more than 30 minutes is considered as a break of session. • This is an approximation since there could be multiple users accessing from the same IP, or the same user accessing from different IPs. • Due to lack of more data available we consider hits from each unique IP as belonging to a unique user for a unique session.
  • 42. No of Sessions Day Number of sessions Number of clicks Day 1 23,440 460,211 Day 2 22,717 453,977 Day 3 24,694 461,518
  • 43. DATA PREPARATION Problem 3: Volume of data – Volume of data is huge. Performing string handling on this volume hits performance Solution: – Convert data fields into non-string fields, dates as dates, numbers as numbers etc.. – Remove unnecessary data (server IP) – Process data in batches of 100000 records – Database tuning, indexing and query tuning required – Over 1500 lines of code written – Processing still required more than 24hours run time Day Number of records 24-May-08 6285949 25-May-08 6061424 26-May-08 6298494
  • 44. DATA PREPARATION Analyzing information in the clickstream. 10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum= HTTP/1.1" 200 3329 Field Descript ion IPaddressof t he server example: 10.208.65.96 IPaddressof t he client example: 10.232.65.96, 10.232.49.1, 203.126.136.220 Dat e and t ime of click (server dat e t ime) example: [23/ May/ 2008:00:00:00 -0400] Request line exact ly asit came from t he client . It has3 subfields, The request met hod, resource request ed and t he prot ocol used, example: GET / profile/ mainmenu.php?checksum= HTTP/ 1.1 Request met hod : GET Resource : / profile/ mainmenu.php?checksum= Prot ocol : HTTP/ 1.1 The HTTPst at uscode ret urned t o t he client . example:200 The cont ent -lengt h of t he document t ransferred. example: 3329 Server IP Client IP Dat e Time URL request ed St at us byt es
  • 45. Data Preparation • Getting additional information – IP addresses allocation by country – Website mapping (identifying key actions on the website) – Identifying visitors, registered users and paid users through the actions performed on the website • Data transformation – Extract client IP address – Represent time as number of seconds past midnight – Extract web action from the URL string – Day of the week
  • 50. Data Preparation • Session Identification – Each unique client IP address is considered as a unique user – A break of more than 30 minutes between clicks is considered as the end of one session – Clicks in a session are ordered by the time of occurrence • Session Sampling – Data volume is huge, need to select sample sessions for further analysis – Sessions having between 50 to 100 clicks are selected for further analysis – Only those records that relate to a specific user action are retained, remaining records are discarded.
  • 51. DATA PREPARATION 10.208.65.96 61.1.81.153 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.1" 304 26 10.208.65.96 68.197.236.117 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum=3590208069017f9d75933dfa9ac9005d|i|537f26ca181f05c308393257397ab261i2810388 HTTP/1.1" 200 3333 10.208.65.96 172.16.25.60, 59.145.189.43 - - [23/May/2008:00:00:00 -0400] "GET /P/css/homestyle.css HTTP/1.0" 304 26 10.208.65.96 10.232.65.96, 10.232.49.1, 203.126.136.220 - - [23/May/2008:00:00:00 -0400] "GET /profile/mainmenu.php?checksum= HTTP/1.1" 200 3329 Day Number of sessions Number of clicks 24th May 2008 23440 460211 25th May 2008 22717 453977 26th May 2008 24694 461518 Day Number of records 24-May-08 6285949 25-May-08 6061424 26-May-08 6298494
  • 52. DATA PREPARATION Preparing data for Associations Preparing data for Sequencing
  • 53. DATA PREPARATION Learnings : Clickstream data should be processed at runtime or at least on a daily basis. Processing this data in batches is not efficient Have a mechanism to capture user ID of person logged on. This is a very important information that is missing in the clickstream data
  • 54. 0 5000 10000 15000 20000 25000 30000 35000 5 20 50 100 200 500 1000 More Bouncers and Serious Users Clicks per IP
  • 55. 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Number of clicks by hour of day Number of clicks
  • 56. 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 US UK Singapore NZ NULL India Europe Australia Asia Pacific Countries By Hour
  • 57. Exit Points Last action performed in a session 0 1000 2000 3000 4000 5000 6000 7000 logoutview profile contact_hit_tryphotocheck index m m _show m sg top_search_bandm ainm enu contacts_m ade_received search_clustering search single_contact_aj login m em _com parison sim profile_search
  • 59. Web Diagram – Freq ≥ 19,000
  • 60. Web Diagram – Freq ≥ 1,000
  • 61. Associations Consequent Antecedent 1 Antecedent 2 Antecedent 3 Antecedent 4 Support % Confidence % Payment = T Photorequest =T memcomp=T 100 73.1 Payment = T Country = India Photorequest= T memcomp=T 80 73 Payment = T Login=T Photorequest= T memcomp=T 60 73 Payment = T ViewProfile= T Photorequest= T memcomp=T 90 72.8 Payment = T ViewProfile= T Login=T Photorequest=T memcomp=T 60 72.5 Payment = T Country = India ViewProfile=T Photorequest=T memcomp=T 70 71.4 Payment = T Mmshowmsg = T Photorequest= T memcomp=T 50 67.2 Payment = T ViewProfile= T Mmshowmsg = T Photorequest=T memcomp=T 50 66.4
  • 62. Summary and Conclusions • Usage of the website by time of the day. – This will help busy hour identification, and provide information of the server capacity required for the website, and when maintenance window can be scheduled. • Usage of website from different geographic location. – This can provide the data of the distribution of users across geographical locations • Exit screens – provide information on where the users exit from the website. This input can help redesign the webpage if it provides information on which pages are breaking the flow of the user session.
  • 63. Summary and Conclusions • Most accessed and least accessed pages – This can be used for variable pricing of advertisings on the web page. This can also be used for better user interface design and space utilization, by removing or repositioning the links that are infrequently accessed. • Associations – Provide information on unique actions on the website and the sequence in which the user has performed these actions. This can be used in better user interface design. • Web diagrams – Gives information on co-occurrence of actions on the webpage and their significance – also provides inputs on user interface design.