Webmining Overview

880
-1

Published on

An introduction to web mining

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
880
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Google’s usage logs are much bigger than their web crawl! Order of magnitude: terabytes per day No human in the loop
  • Infinite number of pages because of dynamically generated content Lots of marketing hype around search engine index size claims
  • Webmining Overview

    1. 1. Introduction to Web Mining
    2. 2. What is Web Mining? <ul><li>Discovering useful information from the World Wide Web ( such as web pages, internet related data and so forth) </li></ul><ul><li>Example of applications: </li></ul><ul><li>User patterns analysis </li></ul><ul><li>Web page link analysis </li></ul><ul><li>And more </li></ul>
    3. 3. Web Mining <ul><ul><li>Involves Textual information and linkage structure analysis </li></ul></ul><ul><ul><li>Peta bytes of data generated per day is comparable to largest conventional data warehouses in world </li></ul></ul><ul><ul><li>Often need to react to evolving usage patterns in real-time (e.g., merchandising) and also accommodate the changes. </li></ul></ul>
    4. 4. Topics related to web mining <ul><li>Web graph analysis </li></ul><ul><li>Power Laws and The Long Tail </li></ul><ul><li>Structured data extraction </li></ul><ul><li>Web advertising </li></ul><ul><li>Systems Issues, User analysis </li></ul><ul><li>Social network analysis, blog analysis </li></ul>
    5. 5. Size of the Web <ul><li>Number of pages </li></ul><ul><ul><li>Technically, infinite </li></ul></ul><ul><ul><li>Much duplication (30-40%) </li></ul></ul><ul><ul><li>Growing everyday </li></ul></ul><ul><ul><li>Best estimate of “unique” static HTML pages comes from search engine claims </li></ul></ul><ul><ul><ul><li>Google recently announced that their index contains 1 trillion pages </li></ul></ul></ul>
    6. 6. The web as a graph <ul><li>Pages = nodes, hyperlinks = edges </li></ul><ul><ul><li>Ignore content </li></ul></ul><ul><ul><li>Directed graph </li></ul></ul><ul><li>High linkage </li></ul><ul><ul><li>10-20 links/page on average </li></ul></ul><ul><ul><li>Power-law degree distribution </li></ul></ul>
    7. 7. Web graph <ul><li>Let’s take a closer look at structure </li></ul><ul><ul><li>Broder et al (2000) studied a crawl of 200M pages and other smaller crawls </li></ul></ul><ul><li>Distinguish “important” pages from unimportant ones </li></ul><ul><ul><li>Page rank </li></ul></ul><ul><li>Discover communities of related pages </li></ul><ul><ul><li>Hubs and Authorities </li></ul></ul><ul><li>Detect web spam </li></ul><ul><ul><li>Trust rank </li></ul></ul>
    8. 8. Searching the Web Content consumers Content aggregators The Web
    9. 9. Two Approaches to Analyzing Data <ul><li>Machine Learning approach </li></ul><ul><ul><li>Emphasizes sophisticated algorithms e.g., Support Vector Machines </li></ul></ul><ul><ul><li>Data sets tend to be small, fit in memory </li></ul></ul><ul><li>Data Mining approach </li></ul><ul><ul><li>Emphasizes big data sets (e.g., in the terabytes) </li></ul></ul><ul><ul><li>Data cannot even fit on a single disk! </li></ul></ul><ul><ul><li>Necessarily leads to simpler algorithms </li></ul></ul>
    10. 10. The future: Very Large-Scale Data Mining … Mem Disk CPU Mem Disk CPU Mem Disk CPU
    11. 11. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at www.dataminingtools.net </li></ul>

    ×