• Save
Webmining Overview
Upcoming SlideShare
Loading in...5

Webmining Overview



An introduction to web mining

An introduction to web mining



Total Views
Views on SlideShare
Embed Views



2 Embeds 5

http://dataminingtools.net 3
http://www.slideshare.net 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Google’s usage logs are much bigger than their web crawl! Order of magnitude: terabytes per day No human in the loop
  • Infinite number of pages because of dynamically generated content Lots of marketing hype around search engine index size claims

Webmining Overview Webmining Overview Presentation Transcript

  • Introduction to Web Mining
  • What is Web Mining?
    • Discovering useful information from the World Wide Web ( such as web pages, internet related data and so forth)
    • Example of applications:
    • User patterns analysis
    • Web page link analysis
    • And more
  • Web Mining
      • Involves Textual information and linkage structure analysis
      • Peta bytes of data generated per day is comparable to largest conventional data warehouses in world
      • Often need to react to evolving usage patterns in real-time (e.g., merchandising) and also accommodate the changes.
  • Topics related to web mining
    • Web graph analysis
    • Power Laws and The Long Tail
    • Structured data extraction
    • Web advertising
    • Systems Issues, User analysis
    • Social network analysis, blog analysis
  • Size of the Web
    • Number of pages
      • Technically, infinite
      • Much duplication (30-40%)
      • Growing everyday
      • Best estimate of “unique” static HTML pages comes from search engine claims
        • Google recently announced that their index contains 1 trillion pages
  • The web as a graph
    • Pages = nodes, hyperlinks = edges
      • Ignore content
      • Directed graph
    • High linkage
      • 10-20 links/page on average
      • Power-law degree distribution
  • Web graph
    • Let’s take a closer look at structure
      • Broder et al (2000) studied a crawl of 200M pages and other smaller crawls
    • Distinguish “important” pages from unimportant ones
      • Page rank
    • Discover communities of related pages
      • Hubs and Authorities
    • Detect web spam
      • Trust rank
  • Searching the Web Content consumers Content aggregators The Web
  • Two Approaches to Analyzing Data
    • Machine Learning approach
      • Emphasizes sophisticated algorithms e.g., Support Vector Machines
      • Data sets tend to be small, fit in memory
    • Data Mining approach
      • Emphasizes big data sets (e.g., in the terabytes)
      • Data cannot even fit on a single disk!
      • Necessarily leads to simpler algorithms
  • The future: Very Large-Scale Data Mining … Mem Disk CPU Mem Disk CPU Mem Disk CPU
  • Visit more self help tutorials
    • Pick a tutorial of your choice and browse through it at your own pace.
    • The tutorials section is free, self-guiding and will not involve any additional support.
    • Visit us at www.dataminingtools.net