Web Mining is the use of the data mining
techniques to automatically discover and extract
information from web documents/services
Discovering useful information from the World-Wide
Web and its usage patterns
 Using data mining techniques to make the web
more useful and more profitable (for some) and to
increase the efficiency of our interaction with the
web
Web usage mining is the process of extracting useful
information from server logs e.g. use Web usage
mining is the process of finding out what users are
looking for on the Internet. Some users might be
looking at only textual data, whereas some others
might be interested in multimedia data. Web Usage
Mining is the application of data mining techniques to
discover interesting usage patterns from Web data in
order to understand and better serve the needs of
Web-based applications.
Web Mining
Web content
mining
Web page
content mining
Search result
mining
Web structure
mining
Web usage
mining
General
access pattern
tracking
Customized
usage tracking
 Data Mining Techniques
 Association rules
 Sequential patterns
 Classification
 Clustering
 Outlier discovery
 Applications to the Web
 E-commerce
 Information retrieval (search)
 Network management
The WWW is huge, widely distributed, global
information service centre for
Information services: news, advertisements, consumer
information, financial management, education,
government, e-commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources of data for data mining
 Enormous wealth of information on Web
 Financial information (e.g. stock quotes)
 Book/CD/Video stores (e.g. Amazon)
 Restaurant information (e.g. Zagat's)
 Car prices (e.g. CarPoint)
 Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
 Possible to mine interesting nuggets of information
 People who ski also travel frequently to Europe
 Tech stocks have corrections in the summer and rally from
November until February
 The Web is a huge collection of documents except
for
 Hyper-link information
 Access and usage information
 The Web is very dynamic
 New pages are constantly being generated
 Challenge: Develop new Web mining algorithms and
adapt traditional data mining algorithms to
 Exploit hyper-links and access patterns
 Be incremental
 Given:
 A source of textual documents
 A well defined limited query (text based)
 Find:
 Sentences with relevant information
 Extract the relevant information and
ignore non-relevant information (important!)
 Link related information and output in a
predetermined format
Keyword (or term) based association analysis
automatic document (topic) classification
similarity detection
cluster documents by a common author
cluster documents containing information from a
common source
sequence analysis: predicting a recurring
event, discovering trends
anomaly detection: find information that
violates usual patterns
Pre-Processing Pattern Discovery Pattern Analysis
Raw
Sever log
User session
File Rules and Patterns Interesting
Knowledge
Creating a model of web organization
Classify web pages
Create similarity measures between web
pages
Page Rank
The Clever system
Hyperlink induced topic search(HITS)
 Combine the intelligent IR tools
 meaning of words
 order of words in the query
 user dependency for the data
 authority of the source
 With the unique web features
 retrieve Hyper-link information
 utilize Hyper-link as input
Program which browses WWW in a methodical,
automated manner
Copy in cache and do Indexing
Starts from a seed url
Searches and finds links, keywords
Types of Crawler
Context focused
Focused
Incremental
Periodic
Link analysis algorithm which assigns
numerical weight to a webpage.
The numerical weight that it assigns to any
given element E is also called the PageRank of
E and denoted by PR(E).
the PageRank value for a page u is dependent
on the PageRank values for each page v out of
the set Bu (this set contains all pages linking to
page u), divided by the number L(v) of links
from page v.
Increase effectiveness of search
engines
Based on number of back links
Rank sink problem exists
Finds both authoritative pages and
hubs
Authoritative - best source
Hub - link to authoritative pages
Most value page returned
Hyperlink Induced Topic Search
Keywords
Authority and hub measure
Applies mining on web usage data or weblogs
or clickstream data
Client perspective
Server perspective
Aid in personalization
Helps in evaluating quality and effectiveness
Preprocessing, pattern discovery and data
structures
web mining
web mining
web mining
web mining

web mining

  • 2.
    Web Mining isthe use of the data mining techniques to automatically discover and extract information from web documents/services Discovering useful information from the World-Wide Web and its usage patterns  Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
  • 3.
    Web usage miningis the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications.
  • 4.
    Web Mining Web content mining Webpage content mining Search result mining Web structure mining Web usage mining General access pattern tracking Customized usage tracking
  • 5.
     Data MiningTechniques  Association rules  Sequential patterns  Classification  Clustering  Outlier discovery  Applications to the Web  E-commerce  Information retrieval (search)  Network management
  • 6.
    The WWW ishuge, widely distributed, global information service centre for Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources of data for data mining
  • 7.
     Enormous wealthof information on Web  Financial information (e.g. stock quotes)  Book/CD/Video stores (e.g. Amazon)  Restaurant information (e.g. Zagat's)  Car prices (e.g. CarPoint)  Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users  Possible to mine interesting nuggets of information  People who ski also travel frequently to Europe  Tech stocks have corrections in the summer and rally from November until February
  • 8.
     The Webis a huge collection of documents except for  Hyper-link information  Access and usage information  The Web is very dynamic  New pages are constantly being generated  Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to  Exploit hyper-links and access patterns  Be incremental
  • 9.
     Given:  Asource of textual documents  A well defined limited query (text based)  Find:  Sentences with relevant information  Extract the relevant information and ignore non-relevant information (important!)  Link related information and output in a predetermined format
  • 10.
    Keyword (or term)based association analysis automatic document (topic) classification similarity detection cluster documents by a common author cluster documents containing information from a common source sequence analysis: predicting a recurring event, discovering trends anomaly detection: find information that violates usual patterns
  • 11.
    Pre-Processing Pattern DiscoveryPattern Analysis Raw Sever log User session File Rules and Patterns Interesting Knowledge
  • 12.
    Creating a modelof web organization Classify web pages Create similarity measures between web pages Page Rank The Clever system Hyperlink induced topic search(HITS)
  • 13.
     Combine theintelligent IR tools  meaning of words  order of words in the query  user dependency for the data  authority of the source  With the unique web features  retrieve Hyper-link information  utilize Hyper-link as input
  • 14.
    Program which browsesWWW in a methodical, automated manner Copy in cache and do Indexing Starts from a seed url Searches and finds links, keywords Types of Crawler Context focused Focused Incremental Periodic
  • 15.
    Link analysis algorithmwhich assigns numerical weight to a webpage. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E). the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
  • 16.
    Increase effectiveness ofsearch engines Based on number of back links Rank sink problem exists
  • 17.
    Finds both authoritativepages and hubs Authoritative - best source Hub - link to authoritative pages Most value page returned Hyperlink Induced Topic Search Keywords Authority and hub measure
  • 18.
    Applies mining onweb usage data or weblogs or clickstream data Client perspective Server perspective Aid in personalization Helps in evaluating quality and effectiveness Preprocessing, pattern discovery and data structures