Introduction to Web Mining
What is Web Mining? Discovering useful information from the World Wide Web ( such as web pages, internet related data and so forth) Example of applications:  User patterns analysis Web page link analysis And more
Web Mining Involves Textual information and linkage structure analysis Peta bytes of data generated per day is comparable to largest conventional data warehouses in world Often need to react to evolving usage patterns in real-time (e.g., merchandising) and also accommodate the changes.
Topics related to web mining Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising  Systems Issues, User analysis Social network analysis, blog analysis
Size of the Web Number of pages Technically, infinite Much duplication (30-40%) Growing everyday Best estimate of “unique” static HTML pages comes from search engine claims Google recently announced that their index contains 1 trillion pages
The web as a graph Pages = nodes, hyperlinks = edges Ignore content Directed graph High linkage 10-20 links/page on average Power-law degree distribution
Web graph Let’s take a closer look at structure Broder et al (2000) studied a crawl of 200M pages and other smaller crawls Distinguish “important” pages from unimportant ones Page rank Discover communities of related pages  Hubs and Authorities Detect web spam Trust rank
Searching the Web Content consumers Content aggregators The Web
Two Approaches to Analyzing Data Machine Learning approach Emphasizes sophisticated algorithms e.g., Support Vector Machines Data sets tend to be small, fit in memory Data Mining approach Emphasizes big data sets (e.g., in the terabytes) Data cannot even fit on a single disk! Necessarily leads to simpler algorithms
The future: Very Large-Scale Data Mining … Mem Disk CPU Mem Disk CPU Mem Disk CPU
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Webmining Overview

  • 1.
  • 2.
    What is WebMining? Discovering useful information from the World Wide Web ( such as web pages, internet related data and so forth) Example of applications: User patterns analysis Web page link analysis And more
  • 3.
    Web Mining InvolvesTextual information and linkage structure analysis Peta bytes of data generated per day is comparable to largest conventional data warehouses in world Often need to react to evolving usage patterns in real-time (e.g., merchandising) and also accommodate the changes.
  • 4.
    Topics related toweb mining Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues, User analysis Social network analysis, blog analysis
  • 5.
    Size of theWeb Number of pages Technically, infinite Much duplication (30-40%) Growing everyday Best estimate of “unique” static HTML pages comes from search engine claims Google recently announced that their index contains 1 trillion pages
  • 6.
    The web asa graph Pages = nodes, hyperlinks = edges Ignore content Directed graph High linkage 10-20 links/page on average Power-law degree distribution
  • 7.
    Web graph Let’stake a closer look at structure Broder et al (2000) studied a crawl of 200M pages and other smaller crawls Distinguish “important” pages from unimportant ones Page rank Discover communities of related pages Hubs and Authorities Detect web spam Trust rank
  • 8.
    Searching the WebContent consumers Content aggregators The Web
  • 9.
    Two Approaches toAnalyzing Data Machine Learning approach Emphasizes sophisticated algorithms e.g., Support Vector Machines Data sets tend to be small, fit in memory Data Mining approach Emphasizes big data sets (e.g., in the terabytes) Data cannot even fit on a single disk! Necessarily leads to simpler algorithms
  • 10.
    The future: VeryLarge-Scale Data Mining … Mem Disk CPU Mem Disk CPU Mem Disk CPU
  • 11.
    Visit more selfhelp tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Editor's Notes

  • #4 Google’s usage logs are much bigger than their web crawl! Order of magnitude: terabytes per day No human in the loop
  • #6 Infinite number of pages because of dynamically generated content Lots of marketing hype around search engine index size claims