1. UNDER THE GUIDANCE OF Ms. Reshma.R.Owhal
Dr. S.F .Sayyad ME(Computer)
Roll No:17MCO004
2. Introduction
Data Collection and Pre-Processing
Data Modeling for Web Usage Mining
Discovery and Analysis of Web Usage
Patterns
Conclusions
References
3. Web usage mining
– can be broadly defined as discovery and analysis
useful information from the WWW.
– automatic discovery of patterns in clickstreams and
associated data, collected or generated as a result of user
interactions with one or more Web sites.
Goal: analyze the behavioral patterns and profiles of
users interacting with a Web site.
4.
5. This is important in Web usage mining due to the
characteristics of clickstream data.
This process is critical to the successful extraction of useful
patterns from the data.
The process may involve pre-processing the original data,is a
process known as data preparation.
6.
7. Data cleaning
– remove irrelevant references and fields in server
logs
– remove references due to spider/robot navigation
– add missing references due to caching (done after
sessionization)
Data fusion/integration
– synchronize data from multiple server logs
– integrate e-commerce and application server data
– integrate meta-data (e.g., content labels)
8. Data transformation
– user identification
– sessionization
– pageview identification
• a pageview is a set of page files and associated
objects that contribute to a single display in a Web Browser
Data Reduction
– sampling and dimensionality reduction (ignoring certain
pageviews / items)
Identifying User Transactions
– i.e., sets or sequences of pageviews possibly with
associated weights
9. Sessionization (Identify sessions )
-It is the process of segmenting the user activity record of
each user into sessions, each representing a single visit to the site.
-The goal of a sessionization heuristic is to reconstruct, from
the clickstream data, the actual sequence of actions performed by
one user during one visit to the site
Difficult to obtain reliable usage data due to
– proxy servers
– dynamic IP addresses,
– the inability of servers.
10. Pageview identification
– Depends on the intra-page structure of sites
– Identify the collection of Web files representing a specific “user
event” corresponding to a clickthrough (e.g. viewing a product page, adding a
product to a shopping cart)
– e.g like the purchase of a product on an online ecommerce Site
User Identification
– The analysis of Web usage does not require knowledge about a
user’s identity. So it is necessary to distinguish among different users.
– Since a user may visit a site more than once, the server logs record
multiple sessions for each user.
11. Path completion
-Client- or proxy-side caching can often result in missing
access references to those pages or objects that have been cached.
- For instance,
– if a user goes back to a page A during the same session, the
second access to A will likely result in viewing the previously
downloaded version of A that was cached on the client-side, and
therefore, no request is made to the server.
– This results in the second reference to A not being
recorded on the server logs.
12.
13. The discovered patterns: usually represented as
– collections of pages, objects, or resources that are
frequently accessed by groups of users with
common interests.
14. Decision Trees
◦ a flow chart of questions leading to a decision
◦ Ex: car buying decision tree
Path Analysis
◦ Uses Graph Model
◦ Provide insights to navigational problems
◦ Example of info. Discovered by Path analysis:
78% “company”-> “what’s new”->“sample”-> “order”
60% left sites after 4 or less page references
=> most important info must be within the first 4 pages of site entry
points.
15. Grouping
◦ Groups similar info. to help draw higher-level conclusions
◦ Ex: all URLs containing the word “Yahoo”…
Filtering
◦ Allows to answer specific questions like:
how many visitors to the site in this week?
Cookies
◦ Randomly assigned ID by web server to browser
◦ Cookies are beneficial to both web site developers and visitors
◦ Cookie field entry in log file can be used by Web traffic analysis
software to track repeat visitors loyal customers.
16. Association Rules
◦ help find spending patterns on related products
◦ 30% who accessed/company/products/bread.html, also accessed
/company/products/milk.htm.
Sequential Patterns
◦ help find inter-transaction patterns
◦ 50% who bought items in /pcworld/computers/, also bought in
/pcworld/accessories/ within 15 days
Clustering
◦ Identifies visitors with common characteristics based on visitors’ profiles
◦ One straightforward approach in creating an aggregate view of each
cluster is to compute the centroid of each cluster.
◦ 50% who applied discover platinum card in
/discovercard/customerService/newcard, were in the 25-35 age group,
with annual income between $40,000 – 50,000.
17. Web Mining support on-going, continuous improvements for E-
businesses
Web usage and data mining to find patterns is a growing area with the
growth of Web-based applications
Application of web usage data can be used to better understand web
usage, and apply this specific knowledge to better serve users
Web usage patterns and data mining can be the basis for a great deal
of future research
18. Web Usage Mining from Bing Liu. “Web Data Mining: Exploring
Hyperlinks, Contents, and Usage Data”, Springer Chapter written by
Bamshad Mobasher.
Web Usage Mining-What, Why, hoW Presented by : Roopa Datla ,
Jinguang Liu.
Web Usage Mining: Discovery and Applications of Usage Patterns
from Web Data Srivastava J., Cooley R., Deshpande M, Tan
P.N.Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000.
Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE
8331 November 24, 2003.