Web mining

A Presentation
on
Web Mining
Presented By
Tanjarul Islam Mishu
[@tanjarul26]
Dept. of CSE
Jatiya Kabi Kazi Nazrul Islam University
Spring 2006

Spring 2006
Overview
 Web Mining
 Opportunities and Challenges
 Data Mining vs. Web Mining
 Classification of Web Mining Techniques
 Web Content Mining Techniques
 Web Usage Data Sources
 Web Usage Mining Model

Web Mining
 Mining means extracting something useful or valuable
from a baser substance, such as mining gold from the
earth.
 Web mining is the process of using data mining
techniques and algorithms to extract information
directly from the Web.
 It uses Web documents and services,
Web content, hyperlinks and
server logs.

Opportunities and Challenges
 The amount of information on the Web is huge.
 The coverage of Web information
is very wide and diverse.
 All types exist on the Web.
 Much of the Web information is semi-structured due to
the nested structure of HTML code.

Opportunities and Challenges
 Much of the Web information is
linked.
 Much of the Web information is
redundant.
 The Web is noisy.
a mixture of many kinds of
information.
 The Web is dynamic.

Data Mining vs. Web Mining
 Traditional data mining
 data is structured and relational
 well-defined tables, columns, rows, keys, and constraints.
 Web data Mining
 Semi-structured and unstructured
 readily available data
 rich in features and patterns

Classification of Web Mining Techniques
Web Mining
Web
Structure
Mining
Web Content
Mining
Web-Usage
Mining

Web-Structure Mining
 Web Structure Mining is a tool used to identify the
relationship between Web pages linked by information
or direct link connection.
 Structure mining uses minimize two main problems of
the World Wide Web.
 Irrelevant search results
 inability to index the vast
amount if information
provided on the Web.

Web Content Mining
 ‘Process of information’ or resource discovery
from content of millions of sources across the
World Wide Web
 E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
 It is related to text mining
because much of
the web contents are texts.

Web Content Mining Techniques
Web Content
Mining
Classifications Clustering Association

Document Classification
 Supervised Learning
 Supervised learning is a ‘machine learning’ technique for creating
a function from training data .
 The output can predict a class label of the input object (called
classification).
 Techniques used are
 Nearest Neighbor Classifier
 Feature Selection
 Decision Tree
Association
Web Content Mining Tech.
ClusteringClassification

Document Clustering
 Unsupervised Learning : a data set of input objects is
gathered
 Goal : Evolve measures of similarity to cluster a collection
of documents/terms into groups within which similarity
within a cluster is larger than across clusters.
 Hypothesis : Given a `suitable‘ clustering of a collection, if
the user is interested in document/term d/t, he is likely to
be interested in other members of the cluster to which d/t
belongs.
ClusteringClassification Association

Association
Example: Supermarket
Transaction ID Items Purchased
1 butter, bread, tea
2 bread, tea, sugar, egg
3 diaper
… ………
 An association rule can be
“If a customer buys tea, in 50% of cases, he/she also
buys sugar. This happens in 33% of all transactions.
50%: confidence
33%: support
Can also Integrate in Hyperlinks
ClusteringClassification Association

Web-Usage Mining
 What is Usage Mining?
Discovering user ‘navigation patterns’ from web data.
Prediction of user behavior while the user interacts
with the web.

Web-Usage Mining
 Usage Mining Techniques
Data Preparation
Data Collection
Data Selection
Data Cleaning
Data Mining
Navigation Patterns
Sequential Patterns

Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Web Page Hierarchy
of a Web Site
A
B
C D
E

Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Analysis:
Example:
70% of users who accessed /company/product2 did so by starting
at /company and proceeding through /company/new,
/company/products and company/product1
80% of users who accessed the site started from
/company/products
65% of users left the site after
four or less page references

Web-Usage Mining cont…
 Data Mining Techniques – Sequential Patterns
Example:
Supermarket
Cont…
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm Beer
John 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, Coke
Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy

Web-Usage Mining cont…
Customer Sequence
Customer Customer Sequences
John (Beer) (Brandy)
Frank (Juice, Coke) (Beer) (Wine, Cider)
Mary (Beer) (Wine, Cider) (Brandy)
Example:
Supermarket
Cont…
Sequential Patterns with Supporting
Support >= 40% Customers
(Beer) (Brandy) John, Frank
(Beer) (Wine, Cider) Frank, Mary
Mining Result

Web-Usage Mining
Web usage examples
 In Google search, within past week 30% of users who visited
/company/product/ had ‘camera’ as text.
 60% of users who placed an online order in
/company/product1 also placed an order in /company/product4
within 15 days

Web Usage Data
Sources
 Server access logs
 Server Referrer logs
 Agent logs
 Client-side cookies
 User profiles
 Search engine logs
 Database logs
The record of what actions a user takes with his
mouse and keyboard while visiting a site.

Transfer / Access Log
 The transfer/access log contains detailed information about
each request that the server receives from user’s web
browsers.
Time Date Hostname File Requested Amount of data
transferred
Status of the
request
CLIENT
SERVER

Agent Log
 The agent log lists the browsers (including version
number and the platform) that people are using to
connect to your server.
Hostname Version Number Platform
CLIENT
SERVER

Referrer Log
 If a user gets to one of the server’s pages by clicking on a link
from another site, that URL of that site will appear in this
log.
URL REFERRER URL
CLIENT
SERVER

Error Log
 The error log keeps a record of errors and failed requests.
 A request may fail if the page contains links to a file that
does not exist or if the user is not authorized to access a
specific page or file.
CLIENT
SERVER

Web mining

More Related Content

What's hot

Similar to Web mining

More from Tanjarul Islam Mishu

Recently uploaded

Web mining