UNDER THE GUIDANCE OF Ms. Reshma.R.Owhal
Dr. S.F .Sayyad ME(Computer)
Roll No:17MCO004
 Introduction
 Data Collection and Pre-Processing
 Data Modeling for Web Usage Mining
 Discovery and Analysis of Web Usage
Patterns
 Conclusions
 References
 Web usage mining
– can be broadly defined as discovery and analysis
useful information from the WWW.
– automatic discovery of patterns in clickstreams and
associated data, collected or generated as a result of user
interactions with one or more Web sites.
 Goal: analyze the behavioral patterns and profiles of
users interacting with a Web site.
 This is important in Web usage mining due to the
characteristics of clickstream data.
 This process is critical to the successful extraction of useful
patterns from the data.
 The process may involve pre-processing the original data,is a
process known as data preparation.
 Data cleaning
– remove irrelevant references and fields in server
logs
– remove references due to spider/robot navigation
– add missing references due to caching (done after
sessionization)
 Data fusion/integration
– synchronize data from multiple server logs
– integrate e-commerce and application server data
– integrate meta-data (e.g., content labels)
Data transformation
– user identification
– sessionization
– pageview identification
• a pageview is a set of page files and associated
objects that contribute to a single display in a Web Browser
Data Reduction
– sampling and dimensionality reduction (ignoring certain
pageviews / items)
 Identifying User Transactions
– i.e., sets or sequences of pageviews possibly with
associated weights
Sessionization (Identify sessions )
-It is the process of segmenting the user activity record of
each user into sessions, each representing a single visit to the site.
-The goal of a sessionization heuristic is to reconstruct, from
the clickstream data, the actual sequence of actions performed by
one user during one visit to the site
Difficult to obtain reliable usage data due to
– proxy servers
– dynamic IP addresses,
– the inability of servers.
Pageview identification
– Depends on the intra-page structure of sites
– Identify the collection of Web files representing a specific “user
event” corresponding to a clickthrough (e.g. viewing a product page, adding a
product to a shopping cart)
– e.g like the purchase of a product on an online ecommerce Site
User Identification
– The analysis of Web usage does not require knowledge about a
user’s identity. So it is necessary to distinguish among different users.
– Since a user may visit a site more than once, the server logs record
multiple sessions for each user.
Path completion
-Client- or proxy-side caching can often result in missing
access references to those pages or objects that have been cached.
- For instance,
– if a user goes back to a page A during the same session, the
second access to A will likely result in viewing the previously
downloaded version of A that was cached on the client-side, and
therefore, no request is made to the server.
– This results in the second reference to A not being
recorded on the server logs.
 The discovered patterns: usually represented as
– collections of pages, objects, or resources that are
frequently accessed by groups of users with
common interests.
 Decision Trees
◦ a flow chart of questions leading to a decision
◦ Ex: car buying decision tree
 Path Analysis
◦ Uses Graph Model
◦ Provide insights to navigational problems
◦ Example of info. Discovered by Path analysis:
 78% “company”-> “what’s new”->“sample”-> “order”
 60% left sites after 4 or less page references
=> most important info must be within the first 4 pages of site entry
points.
 Grouping
◦ Groups similar info. to help draw higher-level conclusions
◦ Ex: all URLs containing the word “Yahoo”…
 Filtering
◦ Allows to answer specific questions like:
 how many visitors to the site in this week?
 Cookies
◦ Randomly assigned ID by web server to browser
◦ Cookies are beneficial to both web site developers and visitors
◦ Cookie field entry in log file can be used by Web traffic analysis
software to track repeat visitors  loyal customers.
 Association Rules
◦ help find spending patterns on related products
◦ 30% who accessed/company/products/bread.html, also accessed
/company/products/milk.htm.
 Sequential Patterns
◦ help find inter-transaction patterns
◦ 50% who bought items in /pcworld/computers/, also bought in
/pcworld/accessories/ within 15 days
 Clustering
◦ Identifies visitors with common characteristics based on visitors’ profiles
◦ One straightforward approach in creating an aggregate view of each
cluster is to compute the centroid of each cluster.
◦ 50% who applied discover platinum card in
/discovercard/customerService/newcard, were in the 25-35 age group,
with annual income between $40,000 – 50,000.
 Web Mining support on-going, continuous improvements for E-
businesses
 Web usage and data mining to find patterns is a growing area with the
growth of Web-based applications
 Application of web usage data can be used to better understand web
usage, and apply this specific knowledge to better serve users
 Web usage patterns and data mining can be the basis for a great deal
of future research
 Web Usage Mining from Bing Liu. “Web Data Mining: Exploring
Hyperlinks, Contents, and Usage Data”, Springer Chapter written by
Bamshad Mobasher.
 Web Usage Mining-What, Why, hoW Presented by : Roopa Datla ,
Jinguang Liu.
 Web Usage Mining: Discovery and Applications of Usage Patterns
from Web Data Srivastava J., Cooley R., Deshpande M, Tan
P.N.Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000.
 Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE
8331 November 24, 2003.
Thank you…..

Web usage mining

  • 1.
    UNDER THE GUIDANCEOF Ms. Reshma.R.Owhal Dr. S.F .Sayyad ME(Computer) Roll No:17MCO004
  • 2.
     Introduction  DataCollection and Pre-Processing  Data Modeling for Web Usage Mining  Discovery and Analysis of Web Usage Patterns  Conclusions  References
  • 3.
     Web usagemining – can be broadly defined as discovery and analysis useful information from the WWW. – automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites.  Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
  • 5.
     This isimportant in Web usage mining due to the characteristics of clickstream data.  This process is critical to the successful extraction of useful patterns from the data.  The process may involve pre-processing the original data,is a process known as data preparation.
  • 7.
     Data cleaning –remove irrelevant references and fields in server logs – remove references due to spider/robot navigation – add missing references due to caching (done after sessionization)  Data fusion/integration – synchronize data from multiple server logs – integrate e-commerce and application server data – integrate meta-data (e.g., content labels)
  • 8.
    Data transformation – useridentification – sessionization – pageview identification • a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser Data Reduction – sampling and dimensionality reduction (ignoring certain pageviews / items)  Identifying User Transactions – i.e., sets or sequences of pageviews possibly with associated weights
  • 9.
    Sessionization (Identify sessions) -It is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site. -The goal of a sessionization heuristic is to reconstruct, from the clickstream data, the actual sequence of actions performed by one user during one visit to the site Difficult to obtain reliable usage data due to – proxy servers – dynamic IP addresses, – the inability of servers.
  • 10.
    Pageview identification – Dependson the intra-page structure of sites – Identify the collection of Web files representing a specific “user event” corresponding to a clickthrough (e.g. viewing a product page, adding a product to a shopping cart) – e.g like the purchase of a product on an online ecommerce Site User Identification – The analysis of Web usage does not require knowledge about a user’s identity. So it is necessary to distinguish among different users. – Since a user may visit a site more than once, the server logs record multiple sessions for each user.
  • 11.
    Path completion -Client- orproxy-side caching can often result in missing access references to those pages or objects that have been cached. - For instance, – if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. – This results in the second reference to A not being recorded on the server logs.
  • 13.
     The discoveredpatterns: usually represented as – collections of pages, objects, or resources that are frequently accessed by groups of users with common interests.
  • 14.
     Decision Trees ◦a flow chart of questions leading to a decision ◦ Ex: car buying decision tree  Path Analysis ◦ Uses Graph Model ◦ Provide insights to navigational problems ◦ Example of info. Discovered by Path analysis:  78% “company”-> “what’s new”->“sample”-> “order”  60% left sites after 4 or less page references => most important info must be within the first 4 pages of site entry points.
  • 15.
     Grouping ◦ Groupssimilar info. to help draw higher-level conclusions ◦ Ex: all URLs containing the word “Yahoo”…  Filtering ◦ Allows to answer specific questions like:  how many visitors to the site in this week?  Cookies ◦ Randomly assigned ID by web server to browser ◦ Cookies are beneficial to both web site developers and visitors ◦ Cookie field entry in log file can be used by Web traffic analysis software to track repeat visitors  loyal customers.
  • 16.
     Association Rules ◦help find spending patterns on related products ◦ 30% who accessed/company/products/bread.html, also accessed /company/products/milk.htm.  Sequential Patterns ◦ help find inter-transaction patterns ◦ 50% who bought items in /pcworld/computers/, also bought in /pcworld/accessories/ within 15 days  Clustering ◦ Identifies visitors with common characteristics based on visitors’ profiles ◦ One straightforward approach in creating an aggregate view of each cluster is to compute the centroid of each cluster. ◦ 50% who applied discover platinum card in /discovercard/customerService/newcard, were in the 25-35 age group, with annual income between $40,000 – 50,000.
  • 17.
     Web Miningsupport on-going, continuous improvements for E- businesses  Web usage and data mining to find patterns is a growing area with the growth of Web-based applications  Application of web usage data can be used to better understand web usage, and apply this specific knowledge to better serve users  Web usage patterns and data mining can be the basis for a great deal of future research
  • 18.
     Web UsageMining from Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data”, Springer Chapter written by Bamshad Mobasher.  Web Usage Mining-What, Why, hoW Presented by : Roopa Datla , Jinguang Liu.  Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data Srivastava J., Cooley R., Deshpande M, Tan P.N.Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000.  Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE 8331 November 24, 2003.
  • 19.