Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Webmining Overview


Published on

Introduction to Web Mining

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Webmining Overview

  1. 1. Introduction to Web Mining
  2. 2. What is Web Mining? <ul><li>Discovering useful information from the World Wide Web ( such as web pages, internet related data and so forth) </li></ul><ul><li>Example of applications: </li></ul><ul><li>User patterns analysis </li></ul><ul><li>Web page link analysis </li></ul><ul><li>And more </li></ul>
  3. 3. Web Mining <ul><ul><li>Involves Textual information and linkage structure analysis </li></ul></ul><ul><ul><li>Peta bytes of data generated per day is comparable to largest conventional data warehouses in world </li></ul></ul><ul><ul><li>Often need to react to evolving usage patterns in real-time (e.g., merchandising) and also accommodate the changes. </li></ul></ul>
  4. 4. Topics related to web mining <ul><li>Web graph analysis </li></ul><ul><li>Power Laws and The Long Tail </li></ul><ul><li>Structured data extraction </li></ul><ul><li>Web advertising </li></ul><ul><li>Systems Issues, User analysis </li></ul><ul><li>Social network analysis, blog analysis </li></ul>
  5. 5. Size of the Web <ul><li>Number of pages </li></ul><ul><ul><li>Technically, infinite </li></ul></ul><ul><ul><li>Much duplication (30-40%) </li></ul></ul><ul><ul><li>Growing everyday </li></ul></ul><ul><ul><li>Best estimate of “unique” static HTML pages comes from search engine claims </li></ul></ul><ul><ul><ul><li>Google recently announced that their index contains 1 trillion pages </li></ul></ul></ul>
  6. 6. The web as a graph <ul><li>Pages = nodes, hyperlinks = edges </li></ul><ul><ul><li>Ignore content </li></ul></ul><ul><ul><li>Directed graph </li></ul></ul><ul><li>High linkage </li></ul><ul><ul><li>10-20 links/page on average </li></ul></ul><ul><ul><li>Power-law degree distribution </li></ul></ul>
  7. 7. Web graph <ul><li>Let’s take a closer look at structure </li></ul><ul><ul><li>Broder et al (2000) studied a crawl of 200M pages and other smaller crawls </li></ul></ul><ul><li>Distinguish “important” pages from unimportant ones </li></ul><ul><ul><li>Page rank </li></ul></ul><ul><li>Discover communities of related pages </li></ul><ul><ul><li>Hubs and Authorities </li></ul></ul><ul><li>Detect web spam </li></ul><ul><ul><li>Trust rank </li></ul></ul>
  8. 8. Searching the Web Content consumers Content aggregators The Web
  9. 9. Two Approaches to Analyzing Data <ul><li>Machine Learning approach </li></ul><ul><ul><li>Emphasizes sophisticated algorithms e.g., Support Vector Machines </li></ul></ul><ul><ul><li>Data sets tend to be small, fit in memory </li></ul></ul><ul><li>Data Mining approach </li></ul><ul><ul><li>Emphasizes big data sets (e.g., in the terabytes) </li></ul></ul><ul><ul><li>Data cannot even fit on a single disk! </li></ul></ul><ul><ul><li>Necessarily leads to simpler algorithms </li></ul></ul>
  10. 10. The future: Very Large-Scale Data Mining … Mem Disk CPU Mem Disk CPU Mem Disk CPU
  11. 11. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at </li></ul>