10/9/2013 1
Web mining is to apply data mining techniques
to extract and uncover knowledge from web
documents and services.
Using data mining techniques to make the web
more useful and more profitable and to
increase the efficiency of our interaction with
the web.
10/9/2013 2
10/9/2013 3
Web: A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected
information repository.
Web is a huge collection of documents plus
– Hyper-link information
– Access and usage information
10/9/2013 4
Resource Finding.
Information selection & Pre-processing.
Generalization.
Analysis.
10/9/2013 5
WEB
MINING
WEB USAGE
MINING
WEB
STRUCTURE
MINING
WEB
CONTENT
MINING
CUSTOMIZED
USAGE
TRACKING
GENERAL
ACCESS
PATTERN
TRACKING
SEARCH
RESULT
MINING
WEB PAGE
CONTENT
MINING
10/9/2013 6
Discovery of useful information from web
contents /data /documents.
Information Retrieval view.
Database View.
10/9/2013 7
Researchers proposed methods of using citations
among journal articles to evaluate the quality of
research papers.
Customer behavior – evaluate a quality of a product
based on the opinions of other customers (instead of
product’s description or advertisement).
10/9/2013 8
It’s also known as Web log Mining.
DEFINITION
Discovery of meaningful patterns from data
generated by client-server transactions (or) from Web
server logs.
Typical Sources of Data:
automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies.
user profiles.
metadata: page attributes, content attributes, usage data.
10/9/2013 9
Generate simple statistical reports:
A summary report of hits and bytes transferred
A list of top requested URLs
A list of top referrers
A list of most common browsers used
Hits per hour/day/week/month reports
Hits per domain reports
Learn:
Who is visiting you site
The path visitors take through your pages
How much time visitors spend on each page
The most common starting page
Where visitors are leaving your site
10/9/2013 10
Weblog is Filtered to generate a relational Database.
A Data cube is generated from Database.
OLAP is used to drill-down and roll-up in the cube.
10/9/2013 11
WEB LOG Database
Data
Cleaning
Knowledge
Patterns
Data cube
creation
Data cube Sliced and
diced cube
Data
Mining
OLAP
Hubs.
Authority.
Mutual Reinforcing
Relationship.
Finding Authoritative
Web Pages.
Hyperlinks can infer
the notation of
Authority.
10/9/2013 12
HUBS AUTHORITIES
Hub-Authority Relations
10/9/2013 13
HITS Stands for Hyperlink-Induced Topic Search.
It Explore interactions between hubs and authoritative
pages.
Expand the root set into a base set.
Apply Weight-Propagation.
System Based on the HITS Algorithm.
- eg) GOOGLE.
Difficulties from ignoring textual contexts
-Drifting: When Hubs contains Multiple Topics.
-Topic hijacking: When Many Pages from a single web
site point to the same single Popular site.
10/9/2013 14
Improve web server system performance.
Improve site Design.
Intrusion Detection.
Predict user’s Action.
Enhance the quality and delivery of the internet
information services to the end user.
Facilitates Adaptive sites/personalization.
10/9/2013 15
10/9/2013 16

Web mining

  • 1.
  • 2.
    Web mining isto apply data mining techniques to extract and uncover knowledge from web documents and services. Using data mining techniques to make the web more useful and more profitable and to increase the efficiency of our interaction with the web. 10/9/2013 2
  • 3.
  • 4.
    Web: A huge,widely-distributed, highly heterogeneous, semi-structured, hypertext/hypermedia, interconnected information repository. Web is a huge collection of documents plus – Hyper-link information – Access and usage information 10/9/2013 4
  • 5.
    Resource Finding. Information selection& Pre-processing. Generalization. Analysis. 10/9/2013 5
  • 6.
  • 7.
    Discovery of usefulinformation from web contents /data /documents. Information Retrieval view. Database View. 10/9/2013 7
  • 8.
    Researchers proposed methodsof using citations among journal articles to evaluate the quality of research papers. Customer behavior – evaluate a quality of a product based on the opinions of other customers (instead of product’s description or advertisement). 10/9/2013 8
  • 9.
    It’s also knownas Web log Mining. DEFINITION Discovery of meaningful patterns from data generated by client-server transactions (or) from Web server logs. Typical Sources of Data: automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies. user profiles. metadata: page attributes, content attributes, usage data. 10/9/2013 9
  • 10.
    Generate simple statisticalreports: A summary report of hits and bytes transferred A list of top requested URLs A list of top referrers A list of most common browsers used Hits per hour/day/week/month reports Hits per domain reports Learn: Who is visiting you site The path visitors take through your pages How much time visitors spend on each page The most common starting page Where visitors are leaving your site 10/9/2013 10
  • 11.
    Weblog is Filteredto generate a relational Database. A Data cube is generated from Database. OLAP is used to drill-down and roll-up in the cube. 10/9/2013 11 WEB LOG Database Data Cleaning Knowledge Patterns Data cube creation Data cube Sliced and diced cube Data Mining OLAP
  • 12.
    Hubs. Authority. Mutual Reinforcing Relationship. Finding Authoritative WebPages. Hyperlinks can infer the notation of Authority. 10/9/2013 12 HUBS AUTHORITIES Hub-Authority Relations
  • 13.
  • 14.
    HITS Stands forHyperlink-Induced Topic Search. It Explore interactions between hubs and authoritative pages. Expand the root set into a base set. Apply Weight-Propagation. System Based on the HITS Algorithm. - eg) GOOGLE. Difficulties from ignoring textual contexts -Drifting: When Hubs contains Multiple Topics. -Topic hijacking: When Many Pages from a single web site point to the same single Popular site. 10/9/2013 14
  • 15.
    Improve web serversystem performance. Improve site Design. Intrusion Detection. Predict user’s Action. Enhance the quality and delivery of the internet information services to the end user. Facilitates Adaptive sites/personalization. 10/9/2013 15
  • 16.