Web mining is to apply data mining techniques
to extract and uncover knowledge from web
documents and services.
Using data mining techniques to make the web
more useful and more profitable and to
increase the efficiency of our interaction with
Web: A huge, widely-distributed, highly
Web is a huge collection of documents plus
– Hyper-link information
– Access and usage information
Discovery of useful information from web
contents /data /documents.
Information Retrieval view.
Researchers proposed methods of using citations
among journal articles to evaluate the quality of
Customer behavior – evaluate a quality of a product
based on the opinions of other customers (instead of
product’s description or advertisement).
It’s also known as Web log Mining.
Discovery of meaningful patterns from data
generated by client-server transactions (or) from Web
Typical Sources of Data:
automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies.
metadata: page attributes, content attributes, usage data.
Generate simple statistical reports:
A summary report of hits and bytes transferred
A list of top requested URLs
A list of top referrers
A list of most common browsers used
Hits per hour/day/week/month reports
Hits per domain reports
Who is visiting you site
The path visitors take through your pages
How much time visitors spend on each page
The most common starting page
Where visitors are leaving your site
Weblog is Filtered to generate a relational Database.
A Data cube is generated from Database.
OLAP is used to drill-down and roll-up in the cube.
WEB LOG Database
Data cube Sliced and
Hyperlinks can infer
the notation of
HITS Stands for Hyperlink-Induced Topic Search.
It Explore interactions between hubs and authoritative
Expand the root set into a base set.
System Based on the HITS Algorithm.
- eg) GOOGLE.
Difficulties from ignoring textual contexts
-Drifting: When Hubs contains Multiple Topics.
-Topic hijacking: When Many Pages from a single web
site point to the same single Popular site.
Improve web server system performance.
Improve site Design.
Predict user’s Action.
Enhance the quality and delivery of the internet
information services to the end user.
Facilitates Adaptive sites/personalization.