A Presentation
on
Web Mining
Presented By
Tanjarul Islam Mishu
[@tanjarul26]
Dept. of CSE
Jatiya Kabi Kazi Nazrul Islam University
Spring 2006
Spring 2006
Overview
 Web Mining
 Opportunities and Challenges
 Data Mining vs. Web Mining
 Classification of Web Mining Techniques
 Web Content Mining Techniques
 Web Usage Data Sources
 Web Usage Mining Model
Web Mining
 Mining means extracting something useful or valuable
from a baser substance, such as mining gold from the
earth.
 Web mining is the process of using data mining
techniques and algorithms to extract information
directly from the Web.
 It uses Web documents and services,
Web content, hyperlinks and
server logs.
Opportunities and Challenges
 The amount of information on the Web is huge.
 The coverage of Web information
is very wide and diverse.
 All types exist on the Web.
 Much of the Web information is semi-structured due to
the nested structure of HTML code.
Opportunities and Challenges
 Much of the Web information is
linked.
 Much of the Web information is
redundant.
 The Web is noisy.
a mixture of many kinds of
information.
 The Web is dynamic.
Data Mining vs. Web Mining
 Traditional data mining
 data is structured and relational
 well-defined tables, columns, rows, keys, and constraints.
 Web data Mining
 Semi-structured and unstructured
 readily available data
 rich in features and patterns
Classification of Web Mining Techniques
Web Mining
Web
Structure
Mining
Web Content
Mining
Web-Usage
Mining
Web-Structure Mining
 Web Structure Mining is a tool used to identify the
relationship between Web pages linked by information
or direct link connection.
 Structure mining uses minimize two main problems of
the World Wide Web.
 Irrelevant search results
 inability to index the vast
amount if information
provided on the Web.
Web Content Mining
 ‘Process of information’ or resource discovery
from content of millions of sources across the
World Wide Web
 E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
 It is related to text mining
because much of
the web contents are texts.
Web Content Mining Techniques
Web Content
Mining
Classifications Clustering Association
Document Classification
 Supervised Learning
 Supervised learning is a ‘machine learning’ technique for creating
a function from training data .
 The output can predict a class label of the input object (called
classification).
 Techniques used are
 Nearest Neighbor Classifier
 Feature Selection
 Decision Tree
Association
Web Content Mining Tech.
ClusteringClassification
Document Clustering
 Unsupervised Learning : a data set of input objects is
gathered
 Goal : Evolve measures of similarity to cluster a collection
of documents/terms into groups within which similarity
within a cluster is larger than across clusters.
 Hypothesis : Given a `suitable‘ clustering of a collection, if
the user is interested in document/term d/t, he is likely to
be interested in other members of the cluster to which d/t
belongs.
Web Content Mining Tech.
ClusteringClassification Association
Association
Example: Supermarket
Transaction ID Items Purchased
1 butter, bread, tea
2 bread, tea, sugar, egg
3 diaper
… ………
 An association rule can be
“If a customer buys tea, in 50% of cases, he/she also
buys sugar. This happens in 33% of all transactions.
50%: confidence
33%: support
Can also Integrate in Hyperlinks
Web Content Mining Tech.
ClusteringClassification Association
Web-Usage Mining
 What is Usage Mining?
Discovering user ‘navigation patterns’ from web data.
Prediction of user behavior while the user interacts
with the web.
Web-Usage Mining
 Usage Mining Techniques
Data Preparation
Data Collection
Data Selection
Data Cleaning
Data Mining
Navigation Patterns
Sequential Patterns
Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Web Page Hierarchy
of a Web Site
A
B
C D
E
Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Analysis:
Example:
70% of users who accessed /company/product2 did so by starting
at /company and proceeding through /company/new,
/company/products and company/product1
80% of users who accessed the site started from
/company/products
65% of users left the site after
four or less page references
Web-Usage Mining cont…
 Data Mining Techniques – Sequential Patterns
Example:
Supermarket
Cont…
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm Beer
John 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, Coke
Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy
Web-Usage Mining cont…
 Data Mining Techniques – Sequential Patterns
Customer Sequence
Customer Customer Sequences
John (Beer) (Brandy)
Frank (Juice, Coke) (Beer) (Wine, Cider)
Mary (Beer) (Wine, Cider) (Brandy)
Example:
Supermarket
Cont…
Sequential Patterns with Supporting
Support >= 40% Customers
(Beer) (Brandy) John, Frank
(Beer) (Wine, Cider) Frank, Mary
Mining Result
Web-Usage Mining
 Data Mining Techniques – Sequential Patterns
Web usage examples
 In Google search, within past week 30% of users who visited
/company/product/ had ‘camera’ as text.
 60% of users who placed an online order in
/company/product1 also placed an order in /company/product4
within 15 days
Web Usage Data
Sources
 Server access logs
 Server Referrer logs
 Agent logs
 Client-side cookies
 User profiles
 Search engine logs
 Database logs
The record of what actions a user takes with his
mouse and keyboard while visiting a site.
Transfer / Access Log
 The transfer/access log contains detailed information about
each request that the server receives from user’s web
browsers.
Time Date Hostname File Requested Amount of data
transferred
Status of the
request
CLIENT
SERVER
Agent Log
 The agent log lists the browsers (including version
number and the platform) that people are using to
connect to your server.
Hostname Version Number Platform
CLIENT
SERVER
Referrer Log
 If a user gets to one of the server’s pages by clicking on a link
from another site, that URL of that site will appear in this
log.
URL REFERRER URL
CLIENT
SERVER
Error Log
 The error log keeps a record of errors and failed requests.
 A request may fail if the page contains links to a file that
does not exist or if the user is not authorized to access a
specific page or file.
CLIENT
SERVER
Web Usage Mining Model
AnyQuestions???

Web mining

  • 1.
    A Presentation on Web Mining PresentedBy Tanjarul Islam Mishu [@tanjarul26] Dept. of CSE Jatiya Kabi Kazi Nazrul Islam University Spring 2006
  • 2.
    Spring 2006 Overview  WebMining  Opportunities and Challenges  Data Mining vs. Web Mining  Classification of Web Mining Techniques  Web Content Mining Techniques  Web Usage Data Sources  Web Usage Mining Model
  • 3.
    Web Mining  Miningmeans extracting something useful or valuable from a baser substance, such as mining gold from the earth.  Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web.  It uses Web documents and services, Web content, hyperlinks and server logs.
  • 4.
    Opportunities and Challenges The amount of information on the Web is huge.  The coverage of Web information is very wide and diverse.  All types exist on the Web.  Much of the Web information is semi-structured due to the nested structure of HTML code.
  • 5.
    Opportunities and Challenges Much of the Web information is linked.  Much of the Web information is redundant.  The Web is noisy. a mixture of many kinds of information.  The Web is dynamic.
  • 6.
    Data Mining vs.Web Mining  Traditional data mining  data is structured and relational  well-defined tables, columns, rows, keys, and constraints.  Web data Mining  Semi-structured and unstructured  readily available data  rich in features and patterns
  • 7.
    Classification of WebMining Techniques Web Mining Web Structure Mining Web Content Mining Web-Usage Mining
  • 8.
    Web-Structure Mining  WebStructure Mining is a tool used to identify the relationship between Web pages linked by information or direct link connection.  Structure mining uses minimize two main problems of the World Wide Web.  Irrelevant search results  inability to index the vast amount if information provided on the Web.
  • 9.
    Web Content Mining ‘Process of information’ or resource discovery from content of millions of sources across the World Wide Web  E.g. Web data contents: text, Image, audio, video, metadata and hyperlinks  It is related to text mining because much of the web contents are texts.
  • 10.
    Web Content MiningTechniques Web Content Mining Classifications Clustering Association
  • 11.
    Document Classification  SupervisedLearning  Supervised learning is a ‘machine learning’ technique for creating a function from training data .  The output can predict a class label of the input object (called classification).  Techniques used are  Nearest Neighbor Classifier  Feature Selection  Decision Tree Association Web Content Mining Tech. ClusteringClassification
  • 12.
    Document Clustering  UnsupervisedLearning : a data set of input objects is gathered  Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters.  Hypothesis : Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Web Content Mining Tech. ClusteringClassification Association
  • 13.
    Association Example: Supermarket Transaction IDItems Purchased 1 butter, bread, tea 2 bread, tea, sugar, egg 3 diaper … ………  An association rule can be “If a customer buys tea, in 50% of cases, he/she also buys sugar. This happens in 33% of all transactions. 50%: confidence 33%: support Can also Integrate in Hyperlinks Web Content Mining Tech. ClusteringClassification Association
  • 14.
    Web-Usage Mining  Whatis Usage Mining? Discovering user ‘navigation patterns’ from web data. Prediction of user behavior while the user interacts with the web.
  • 15.
    Web-Usage Mining  UsageMining Techniques Data Preparation Data Collection Data Selection Data Cleaning Data Mining Navigation Patterns Sequential Patterns
  • 16.
    Web-Usage Mining  DataMining Techniques – Navigation Patterns Web Page Hierarchy of a Web Site A B C D E
  • 17.
    Web-Usage Mining  DataMining Techniques – Navigation Patterns Analysis: Example: 70% of users who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and company/product1 80% of users who accessed the site started from /company/products 65% of users left the site after four or less page references
  • 18.
    Web-Usage Mining cont… Data Mining Techniques – Sequential Patterns Example: Supermarket Cont… Customer Transaction Time Purchased Items John 6/21/05 5:30 pm Beer John 6/22/05 10:20 pm Brandy Frank 6/20/05 10:15 am Juice, Coke Frank 6/20/05 11:50 am Beer Frank 6/20/05 12:50 am Wine, Cider Mary 6/20/05 2:30 pm Beer Mary 6/21/05 6:17 pm Wine, Cider Mary 6/22/05 5:05 pm Brandy
  • 19.
    Web-Usage Mining cont… Data Mining Techniques – Sequential Patterns Customer Sequence Customer Customer Sequences John (Beer) (Brandy) Frank (Juice, Coke) (Beer) (Wine, Cider) Mary (Beer) (Wine, Cider) (Brandy) Example: Supermarket Cont… Sequential Patterns with Supporting Support >= 40% Customers (Beer) (Brandy) John, Frank (Beer) (Wine, Cider) Frank, Mary Mining Result
  • 20.
    Web-Usage Mining  DataMining Techniques – Sequential Patterns Web usage examples  In Google search, within past week 30% of users who visited /company/product/ had ‘camera’ as text.  60% of users who placed an online order in /company/product1 also placed an order in /company/product4 within 15 days
  • 21.
    Web Usage Data Sources Server access logs  Server Referrer logs  Agent logs  Client-side cookies  User profiles  Search engine logs  Database logs The record of what actions a user takes with his mouse and keyboard while visiting a site.
  • 22.
    Transfer / AccessLog  The transfer/access log contains detailed information about each request that the server receives from user’s web browsers. Time Date Hostname File Requested Amount of data transferred Status of the request CLIENT SERVER
  • 23.
    Agent Log  Theagent log lists the browsers (including version number and the platform) that people are using to connect to your server. Hostname Version Number Platform CLIENT SERVER
  • 24.
    Referrer Log  Ifa user gets to one of the server’s pages by clicking on a link from another site, that URL of that site will appear in this log. URL REFERRER URL CLIENT SERVER
  • 25.
    Error Log  Theerror log keeps a record of errors and failed requests.  A request may fail if the page contains links to a file that does not exist or if the user is not authorized to access a specific page or file. CLIENT SERVER
  • 26.
  • 27.