Mining di dati web

822 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
822
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining di dati web

  1. 1. Mining di dati web A.A 2006/2007
  2. 2. Il Corso <ul><li>Codice:nw451 </li></ul><ul><li>Sigla:MDW </li></ul><ul><li>Crediti:6 </li></ul><ul><li>Orario: Mercoledì e Venerdì 16:00-18:00, aula B </li></ul><ul><li>Ricevimento: </li></ul><ul><ul><li>Richiedere appuntamento per e-mail </li></ul></ul><ul><ul><li>c/o ISTI, Area Ricerca CNR, località San Cataldo, Pisa, ingresso 19 </li></ul></ul>
  3. 3. Docenti <ul><li>Raffaele Perego [email_address] , tel.0503152993 </li></ul><ul><li>Claudio.Lucchese claudio.lucchese@isti.cnr.it, tel.0503152967 </li></ul><ul><li>Fabrizio Silvestri [email_address] , tel.0503153011 </li></ul><ul><li>Diego Puppin diego.puppin@isti.cnr.it, tel.0503153011 </li></ul><ul><li>Antonio Panciatici antonio.panciatici@isti.cnr.it, tel.0503152967 </li></ul>
  4. 4. Obiettivi del corso <ul><li>Il World Wide Web (WWW) ha cambiato il modo di concepire le informazioni, di renderle fruibili e di gestirle. </li></ul><ul><li>Scoprire nel web informazioni non note, non banali e rilevanti è sempre più importante e difficile. </li></ul><ul><li>Il Web mining è quindi diventato fondamentale per l’ottimizzazione di strumenti strategici quali i siti di e-commerce, i motori di ricerca, le directory </li></ul><ul><li>Il corso si propone l’obiettivo di fornire strumenti e conoscenze in questo settore </li></ul>
  5. 5. Contenuti del Corso <ul><li>Introduzione </li></ul><ul><ul><li>Data Mining, Knowledge Discovery e il Web </li></ul></ul><ul><li>Motori di Ricerca </li></ul><ul><ul><li>Crawling, indexing, querying </li></ul></ul><ul><li>Web Content Mining </li></ul><ul><ul><li>Similarità, clustering, classificazione di testi </li></ul></ul><ul><li>Web Structure Mining </li></ul><ul><ul><li>Social networks, ranking, ecc. </li></ul></ul><ul><li>Web Usage Mining </li></ul><ul><ul><li>Recommender systems, ecc. </li></ul></ul><ul><li>Argomenti avanzati (?!) </li></ul>
  6. 6. Materiale didattico <ul><li>Libro di testo </li></ul><ul><ul><li>Mining the Web: discovering knowledge from hypertext data. S. Chakrabarti. Morgan Kaufmann, 2003. </li></ul></ul><ul><li>Libri Consigliati </li></ul><ul><ul><li>Managing Gigabytes. I.H. Witten e A. Moffat e T.C. Bell. Morgan Kaufmann, 1999. </li></ul></ul><ul><ul><li>Modern Information Retrieval. R. Baeza-Yates e B. Ribeiro-Neto. Addison Wesley, 1999. </li></ul></ul><ul><li>Lucidi delle lezioni e articoli </li></ul><ul><ul><li>Pubblicati su http://malvasia.isti.cnr.it/~raffaele/webmining </li></ul></ul>
  7. 7. Materiale didattico <ul><li>Si ringraziano </li></ul><ul><ul><li>Chakrabarti e Ramakrishnan </li></ul></ul><ul><ul><ul><li>Per i lucidi allegati al libro di testo scaricabili all’indirizzo: http://www.cse.iitb.ac.in/~soumen/mining-the-web/ </li></ul></ul></ul><ul><ul><li>Fosca Giannotti e Dino Pedreschi </li></ul></ul><ul><ul><ul><li>Per i lucidi introduttivi mutuati dal corso TDM </li></ul></ul></ul><ul><ul><li>KDNUGGETS (http://www.kdnuggets.com) </li></ul></ul><ul><ul><li>Ferragina, Attardi, Garcia Molina, ecc. </li></ul></ul><ul><ul><li>Internet :-) </li></ul></ul>
  8. 8. Esame <ul><li>Prerequisiti (consigliati) </li></ul><ul><ul><li>AA270 – TDM – Tecniche di “Data Mining” – Primo Semestre. </li></ul></ul><ul><li>Modalità di Esame </li></ul><ul><ul><li>Il superamento dell’esame è condizionato al corretto svolgimento di un progetto (individuale o di gruppo?) e da una discussione orale sui contenuti del corso (seminario su un articolo a scelta?). </li></ul></ul>
  9. 9. Introduzione <ul><li>Data Mining e Knowledge Discovery </li></ul><ul><li>Ipertesti e cenni di storia del Web </li></ul><ul><li>Web Mining </li></ul>
  10. 10. What is DM?
  11. 11. What is DM?
  12. 12. Motivations for DM <ul><li>Data explosion problem: </li></ul><ul><ul><li>Automated data collection tools, mature database technology and internet, lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. </li></ul></ul><ul><li>We are drowning in information , but starving for knowledge! ( John Naisbett ) </li></ul><ul><li>Data mining : </li></ul><ul><ul><li>Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large amounts of data </li></ul></ul>
  13. 13. <ul><li>Abundance of business and industry data </li></ul><ul><ul><li>Competitive focus - Knowledge Management </li></ul></ul><ul><li>Inexpensive, powerful computing engines </li></ul><ul><li>Strong theoretical/mathematical foundations </li></ul><ul><ul><li>machine learning & logic </li></ul></ul><ul><ul><li>statistics </li></ul></ul><ul><ul><li>database management systems </li></ul></ul><ul><ul><li>Etc. </li></ul></ul>Motivations for DM
  14. 14. Sources of Data (e.g.) <ul><li>Business Transactions </li></ul><ul><ul><li>widespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day, credit card records!!) </li></ul></ul><ul><ul><li>most important problem: effective use of the data in a reasonable time frame for competitive decision-making </li></ul></ul><ul><ul><li>e-commerce data </li></ul></ul><ul><li>Scientific Data </li></ul><ul><ul><li>data generated through multitude of experiments and observations </li></ul></ul><ul><ul><li>examples, geological data, satellite imaging data, NASA earth observations, CERN HEP </li></ul></ul><ul><ul><li>rate of data collection far exceeds the speed by which we analyze them </li></ul></ul><ul><li>Financial Data </li></ul><ul><ul><li>company information </li></ul></ul><ul><ul><li>economic data (GNP, price indexes, etc.) </li></ul></ul><ul><ul><li>stock markets </li></ul></ul>
  15. 15. Sources of Data (e.g.) <ul><li>Personal / Statistical Data </li></ul><ul><ul><li>government census </li></ul></ul><ul><ul><li>medical histories </li></ul></ul><ul><ul><li>customer profiles </li></ul></ul><ul><ul><li>demographic data </li></ul></ul><ul><ul><li>data and statistics about sports and athletes </li></ul></ul><ul><li>World Wide Web and Online Repositories </li></ul><ul><ul><li>Billions of Web documents, images, video, etc. </li></ul></ul><ul><ul><li>emails, news, messages </li></ul></ul><ul><ul><li>link structure of the hypertext from millions of Web sites </li></ul></ul><ul><ul><li>Web usage data (from server/proxy logs, network traffic, and user registrations) </li></ul></ul><ul><ul><li>online databases, and digital libraries </li></ul></ul>
  16. 16. Classes of DM applications <ul><li>Database analysis and decision support </li></ul><ul><ul><li>Market analysis </li></ul></ul><ul><ul><ul><li>target marketing, customer relation management, market basket analysis </li></ul></ul></ul><ul><ul><li>Risk analysis </li></ul></ul><ul><ul><ul><li>Forecasting, customer retention, quality control, competitive analysis. </li></ul></ul></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><li>Text mining </li></ul><ul><ul><li>E.g. Mining opinions from email, documents </li></ul></ul>
  17. 17. <ul><li>THE WEB !! </li></ul><ul><ul><li>Searching: google, askjeeves, yahoo, etc. </li></ul></ul><ul><ul><li>Social networks analysis </li></ul></ul><ul><ul><li>Web advertizing </li></ul></ul><ul><ul><ul><li>E.g. IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior, analyzing effectiveness of Web marketing, improving Web site organization, etc. </li></ul></ul></ul><ul><ul><ul><li>Watch for the PRIVACY pitfall! </li></ul></ul></ul><ul><li>Many Others …. </li></ul><ul><ul><li>Sports. IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. </li></ul></ul><ul><ul><li>Astronomy. JPL and the Palomar Observatory discovered 22 quasars with the help of data mining </li></ul></ul>Classes of DM applications
  18. 18. <ul><li>The selection and processing of data for: </li></ul><ul><ul><li>the identification of novel , accurate, and useful patterns, and </li></ul></ul><ul><ul><li>the modeling of real-world phenomena. </li></ul></ul><ul><li>Data mining is a major component of the KDD process </li></ul><ul><ul><li>automated discovery of patterns and development of predictive and explanatory models. </li></ul></ul>What is KDD? A process!
  19. 19. The KDD process Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse Data Sources Patterns & Models Prepared Data Consolidated Data
  20. 20. The KDD Process in Practice <ul><li>KDD steps can be merged or combined </li></ul><ul><ul><li>Data Selection + Data Transformation = Data Consolidation </li></ul></ul><ul><ul><li>Data Cleaning + Data Integration = Data Preprocessing </li></ul></ul><ul><li>KDD is an Iterative Process </li></ul><ul><ul><li>art + engineering rather than science </li></ul></ul>
  21. 21. The virtuous cycle Identify Problem or Opportunity Measure effect of Action Act on Knowledge Knowledge Results Strategy Problem
  22. 22. <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals of application </li></ul></ul><ul><li>Data consolidation: Creating a target data set </li></ul><ul><li>Selection and Preprocessing </li></ul><ul><ul><li>Data cleaning : (may take 60% of effort!) </li></ul></ul><ul><ul><li>Data reduction and projection : </li></ul></ul><ul><ul><ul><li>find useful features, dimensionality/variable reduction, invariant representation. </li></ul></ul></ul><ul><li>Choosing data mining methods </li></ul><ul><ul><li>E.g., classification, association, clustering. </li></ul></ul><ul><li>Choosing the mining algorithm(s) </li></ul><ul><li>Data mining : search for patterns of interest </li></ul><ul><li>Interpretation and evaluation : analysis of results. </li></ul><ul><ul><li>visualization, transformation, removing redundant patterns , … </li></ul></ul><ul><li>Use of discovered knowledge </li></ul>The steps of the KDD process
  23. 23. Roles in the KDD process
  24. 24. Major Data Mining Tasks <ul><li>Classification: predicting an item class </li></ul><ul><li>Clustering: finding clusters in data </li></ul><ul><li>Associations: e.g. A & B & C occur frequently </li></ul><ul><li>Visualization: to facilitate human discovery </li></ul><ul><li>Summarization: describing a group </li></ul><ul><li>Deviation Detection: finding changes </li></ul><ul><li>Estimation: predicting a continuous value </li></ul><ul><li>Link Analysis: finding relationships </li></ul><ul><li>… </li></ul>
  25. 25. Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  26. 26. Clustering Find “natural” grouping of instances given un-labeled data
  27. 27. Association Rules & Frequent Itemsets Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)
  28. 28. Visualization & Data Mining <ul><li>Visualizing the data to facilitate human discovery </li></ul><ul><li>Presenting the discovered results in a visually &quot;nice&quot; way </li></ul>
  29. 29. Summarization <ul><li>Describe features of the selected group </li></ul><ul><li>Use natural language and graphics </li></ul><ul><li>Usually in Combination with Deviation detection or other methods </li></ul>Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
  30. 30. Data Mining Central Quest Find true patterns and avoid overfitting
  31. 31. Overfitting <ul><li>Finding seemingly significant but really random patterns due to searching too many possibilities </li></ul><ul><li>Violation of Occam’s razor </li></ul><ul><ul><li>the explanation of any phenomenon should make as few assumptions as possible </li></ul></ul><ul><ul><li>lex parsimoniae </li></ul></ul><ul><ul><ul><li>entia non sunt multiplicanda praeter necessitatem , </li></ul></ul></ul>
  32. 32. Hypertexts and the Web
  33. 33. World Wide Web <ul><li>Hypertext documents </li></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Links </li></ul></ul><ul><li>Web </li></ul><ul><ul><li>billions of documents </li></ul></ul><ul><ul><li>authored by millions of diverse people </li></ul></ul><ul><ul><li>edited by no one in particular </li></ul></ul><ul><ul><li>distributed over millions of computers, connected by variety of media </li></ul></ul>
  34. 34. History of Hypertext <ul><li>Citation </li></ul><ul><ul><li>Hyperlinking </li></ul></ul><ul><li>Branching, non-linear discourse, nested commentary </li></ul><ul><ul><li>Ramayana - one of the great epic poems of India; attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. </li></ul></ul><ul><ul><li>Mahabharata- an epic poem that recounts the struggle between the Kauravas and Pandavas over the disputed kingdom of Bharata, the ancient name for India </li></ul></ul><ul><ul><li>Talmud - compilation of Jewish oral teachings, assembled in written form in the early centuries of the Christian era </li></ul></ul><ul><li>Dictionary, encyclopedia </li></ul><ul><ul><li>self-contained networks of textual nodes </li></ul></ul><ul><ul><li>joined by referential links </li></ul></ul>
  35. 35. Hypertext systems <ul><li>Memex, 1945 [Vannevar Bush, US President Roosevelt's science advisor ] </li></ul><ul><ul><li>stands for “memory extension” </li></ul></ul><ul><ul><li>Aim: to create and help follow hyperlinks across documents </li></ul></ul><ul><ul><li>photoelectrical-mechanical storage and computing device that could store vast amounts of information, in which a user had the ability to create links of related text and illustrations. This trail could then be stored and used for future reference. Bush believed that using this associative method of information gathering was not only practical in its own right, but was closer to the way the mind ordered information.&quot; </li></ul></ul>
  36. 36. Hypertext systems <ul><li>Hypertext, term coined by Ted Nelson in a 1965 paper to the ACM 20th national conference: </li></ul><ul><ul><li>[...] By 'hypertext' mean nonsequential writing - text that branches and allows choice to the reader, best read at an interactive screen. </li></ul></ul>
  37. 37. Hypertext systems <ul><li>The first hypertext-based system was developed in 1967 by a team of researchers led by Dr. Andries van Dam at Brown University. </li></ul><ul><li>The research was funded by IBM and the first hypertext implementation, Hypertext Editing System, ran on an IBM/360 mainframe. </li></ul><ul><li>IBM later sold the system to the Houston Manned Spacecraft Center which reportedly used it for the Apollo space program documentation </li></ul>
  38. 38. Hypertext systems <ul><li>Xanadu hypertext, by Ted Nelson, 1981: </li></ul><ul><ul><li>In the Xanadu scheme, a universal document database (docuverse), would allow addressing of any substring of any document from any other document. &quot;This requires an even stronger addressing scheme than the Universal Resource Locators used in the World-Wide Web.&quot; [De Bra] </li></ul></ul><ul><ul><li>Additionally, Xanadu would permanently keep every version of every document, thereby eliminating the possibility of a broken link. Xanadu would only maintain the current version of the document in its entirety. </li></ul></ul>
  39. 39. World-wide Web <ul><li>Initiated at CERN in 1989 </li></ul><ul><li>By Tim Berners-Lee, now w3c director: </li></ul><ul><ul><li>“ W3 was originally developed to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups. Originally aimed at the High Energy Physics community, it has spread to other areas and attracted much interest in user support, resource discovery and collaborative work areas. It is currently the most advanced information system deployed on the Internet, and embraces within its data model most information in previous networked information systems.” </li></ul></ul>
  40. 40. World-wide Web <ul><li>GUIs </li></ul><ul><ul><li>Berners-Lee ( WorldWideWeb - 1990) </li></ul></ul><ul><ul><li>Erwise and Viola(1992), Midas (1993) </li></ul></ul><ul><li>Mosaic (1993) </li></ul><ul><ul><li>a hypertext GUI for the X-window system </li></ul></ul><ul><ul><li>HTML: markup language for rendering hypertext </li></ul></ul><ul><ul><li>HTTP: hypertext transport protocol for sending HTML and other data over the Internet </li></ul></ul><ul><ul><li>CERN HTTPD: server of hypertext documents </li></ul></ul>
  41. 41. The early days of the Web : CERN HTTP traffic grows by 1000 between 1991-1994 (image courtesy W3C)
  42. 42. The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)
  43. 43. 1994: the landmark year <ul><li>Foundation of the “Mosaic Communications Corporation” (later Nestcape) </li></ul><ul><li>first World-Wide Web conference </li></ul><ul><li>MIT and CERN agreed to set up the World-wide Web Consortium (W3C). </li></ul>
  44. 44. The Web <ul><li>A populist, participatory medium </li></ul><ul><ul><li>number of writers =(approx) number of readers. </li></ul></ul><ul><ul><li>enables near-zero-cost dissemination of information </li></ul></ul><ul><li>Abundance and authority crisis </li></ul><ul><ul><li>liberal and informal culture of content generation and dissemination. </li></ul></ul><ul><ul><li>Very little uniform civil code. </li></ul></ul><ul><ul><li>redundancy and non-standard form and content. </li></ul></ul><ul><ul><li>millions of qualifying pages for most broad queries </li></ul></ul><ul><ul><ul><li>Example: java or kayaking </li></ul></ul></ul><ul><ul><li>no per se authoritative information about the reliability of a site </li></ul></ul>
  45. 45. Problems due to Uniform accessibility <ul><li>little support for adapting to the background of specific users. </li></ul><ul><li>commercial interests routinely influence the operation of Web search </li></ul><ul><ul><li>Users pay for connection costs, not for contents </li></ul></ul><ul><ul><li>Profit depends from ads, sales, etc </li></ul></ul><ul><ul><li>“ Search Engine Optimization“ !! </li></ul></ul>
  46. 46. What is Web Mining? <ul><li>Examples: </li></ul><ul><li>Web search, e.g. Google, Yahoo, MSN, Ask, … </li></ul><ul><li>Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog) </li></ul><ul><li>eCommerce : </li></ul><ul><ul><li>Recommendations: e.g. Netflix, Amazon </li></ul></ul><ul><ul><li>improving conversion rate: next best product to offer </li></ul></ul><ul><li>Advertising, e.g. Google Adsense </li></ul><ul><li>Fraud detection: click fraud detection, … </li></ul><ul><li>Improving Web site design and performance </li></ul>Discovering interesting and useful information from Web content , structure and usage
  47. 47. How does it differ from “classical” Data Mining? <ul><li>The web is not a relation </li></ul><ul><ul><li>Textual information and linkage structure </li></ul></ul><ul><li>Usage data is huge and growing rapidly </li></ul><ul><ul><li>Google’s usage logs are bigger than their web crawl </li></ul></ul><ul><ul><li>Data generated per day is comparable to largest conventional data warehouses </li></ul></ul><ul><li>Content and structure data rich in features and patterns </li></ul><ul><ul><li>spontaneous formation and evolution of </li></ul></ul><ul><ul><ul><li>topic-induced graph clusters </li></ul></ul></ul><ul><ul><ul><li>hyperlink-induced communities </li></ul></ul></ul><ul><li>Ability to react in real-time to usage patterns </li></ul><ul><ul><li>No human in the loop </li></ul></ul>Reproduced from Ullman & Rajaraman with permission
  48. 48. How big is the Web ? <ul><li>Number of pages </li></ul><ul><ul><li>Technically, infinite </li></ul></ul><ul><ul><ul><li>Because of dynamically generated content </li></ul></ul></ul><ul><ul><ul><li>Lots of duplication (30-40%) </li></ul></ul></ul><ul><ul><li>Best estimate of “unique” static HTML pages comes from search engine claims </li></ul></ul><ul><ul><ul><li>Google = 8 billion, Yahoo = 20 billion </li></ul></ul></ul><ul><ul><ul><li>Lots of marketing hype </li></ul></ul></ul>Reproduced from Ullman & Rajaraman with permission
  49. 49. 96,854,877 web sites (Sept 2006) http://news.netcraft.com/archives/web_server_survey.html Total Sites Across All Domains August 1995 - September 2006
  50. 50. The web as a graph <ul><li>Pages = nodes, hyperlinks = edges </li></ul><ul><ul><li>Ignore content </li></ul></ul><ul><ul><li>Directed graph </li></ul></ul><ul><li>High linkage </li></ul><ul><ul><li>8-10 links/page on average </li></ul></ul><ul><ul><li>Power-law degree distribution </li></ul></ul>Reproduced from Ullman & Rajaraman with permission
  51. 51. Power-law degree distribution Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission
  52. 52. Power-laws abounding <ul><li>In-degrees </li></ul><ul><li>Out-degrees </li></ul><ul><li>Number of pages per site </li></ul><ul><li>Number of visitors </li></ul><ul><li>Term distribution in pages </li></ul><ul><li>Query distribution in query logs </li></ul><ul><li>Let’s take a closer look at structure </li></ul><ul><ul><li>Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls </li></ul></ul><ul><ul><ul><li>Not a “small world” </li></ul></ul></ul>Reproduced from Ullman & Rajaraman with permission
  53. 53. Bow-tie Structure Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission
  54. 54. Searching the Web Content consumers Reproduced from Ullman & Rajaraman with permission Content aggregators The Web
  55. 55. Ads vs. search results Reproduced from Ullman & Rajaraman with permission
  56. 56. Ads vs. search results <ul><li>Search advertising is the revenue model </li></ul><ul><ul><li>Multi-billion-dollar industry </li></ul></ul><ul><ul><li>Advertisers pay for clicks on their ads </li></ul></ul><ul><li>Interesting problems </li></ul><ul><ul><li>How to pick the top 10 results for a search from 2,230,000 matching pages? </li></ul></ul><ul><ul><li>What ads to show for a search? </li></ul></ul><ul><ul><li>If I’m an advertiser, which search terms should I bid on and how much to bid? </li></ul></ul>Reproduced from Ullman & Rajaraman with permission
  57. 57. Sidebar: What’s in a name? <ul><li>Geico sued Google, contending that it owned the trademark “Geico” </li></ul><ul><ul><li>Thus, ads for the keyword geico couldn’t be sold to others </li></ul></ul><ul><li>Court Ruling: search engines can sell keywords including trademarks </li></ul><ul><li>No court ruling yet: whether the ad itself can use the trademarked word(s) </li></ul>Reproduced from Ullman & Rajaraman with permission
  58. 58. Extracting Structured Data http://www.simplyhired.com Reproduced from Ullman & Rajaraman with permission
  59. 59. Extracting structured data http://www.fatlens.com Reproduced from Ullman & Rajaraman with permission
  60. 60. The Long Tail (yet another power-law) Source: Chris Anderson (2004) Reproduced from Ullman & Rajaraman with permission
  61. 61. The Long Tail <ul><li>Shelf space is a scarce commodity for traditional retailers </li></ul><ul><ul><li>Also: TV networks, movie theaters,… </li></ul></ul><ul><li>The web enables near-zero-cost dissemination of information about products </li></ul><ul><li>More choices necessitate better filters </li></ul><ul><ul><li>Recommendation engines (e.g., Amazon) </li></ul></ul>Reproduced from Ullman & Rajaraman with permission
  62. 62. Major Web Mining topics <ul><li>Crawling the web </li></ul><ul><li>Web graph analysis </li></ul><ul><li>Structured data extraction </li></ul><ul><li>Classification and vertical search </li></ul><ul><li>Collaborative filtering </li></ul><ul><li>Web advertising and optimization </li></ul><ul><li>Mining web logs </li></ul><ul><li>Systems Issues </li></ul>Reproduced from Ullman & Rajaraman with permission
  63. 63. Web search basics Reproduced from Ullman & Rajaraman with permission The Web Ad indexes Web crawler Indexer Indexes Search User
  64. 64. Search engine components <ul><li>Spider (a.k.a. crawler/robot) – builds corpus </li></ul><ul><ul><li>Collects web pages recursively </li></ul></ul><ul><ul><ul><li>For each known URL, fetch the page, parse it, and extract new URLs </li></ul></ul></ul><ul><ul><ul><li>Repeat </li></ul></ul></ul><ul><ul><li>Additional pages from direct submissions & other sources </li></ul></ul><ul><li>The indexer – creates inverted indexes </li></ul><ul><ul><li>Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. </li></ul></ul><ul><li>Query processor – serves query results </li></ul><ul><ul><li>Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. </li></ul></ul><ul><ul><li>Back end – finds matching documents and ranks them </li></ul></ul>Reproduced from Ullman & Rajaraman with permission

×