Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Information Retrieval and Mining

8,967 views

Published on

Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).

Published in: Technology
  • Be the first to comment

Web Information Retrieval and Mining

  1. 1. Web Retrieval and Mining Overview Source: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
  2. 2. Information Retrieval <ul><li>Methods for finding information in documents </li></ul><ul><ul><li>Started in the 1970s and 1980s </li></ul></ul><ul><li>“ Methods ” </li></ul><ul><ul><li>Algorithms and heuristics </li></ul></ul><ul><li>“ Finding ” </li></ul><ul><ul><li>Query – Document, Document – Document, etc. </li></ul></ul><ul><li>“ Documents ” </li></ul><ul><ul><li>Texts </li></ul></ul>
  3. 3. The Web is different <ul><li>Massive </li></ul><ul><ul><li>Thousands of millions of documents </li></ul></ul><ul><li>Dynamic </li></ul><ul><ul><li>Updates </li></ul></ul><ul><ul><li>Deletes </li></ul></ul><ul><li>Distributed </li></ul><ul><ul><li>Variable quality </li></ul></ul><ul><ul><li>Malicious behavior </li></ul></ul>
  4. 4. Web IR topics <ul><li>Web Search </li></ul><ul><ul><li>Crawling </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><li>Web Mining </li></ul><ul><li>Adversarial Web IR </li></ul><ul><li>Distributed Web IR </li></ul><ul><li>Evaluation </li></ul>
  5. 5. Web search
  6. 6. Main goals <ul><li>Precision </li></ul><ul><ul><li>Relevant documents returned / Documents returned </li></ul></ul><ul><li>Recall </li></ul><ul><ul><li>Relevant documents returned / Relevant documents </li></ul></ul><ul><li>Freshness </li></ul><ul><li>Performance/scalability </li></ul>
  7. 7. Main goals
  8. 8. Two phases of search <ul><li>Off-line </li></ul><ul><ul><li>Crawling and indexing </li></ul></ul><ul><li>On-line </li></ul><ul><ul><li>Querying and ranking </li></ul></ul>
  9. 9. Search phases
  10. 10. Web crawling <ul><li>Download pages following rules </li></ul><ul><li>Applications </li></ul><ul><ul><li>Create index for search </li></ul></ul><ul><ul><li>Find particular information items </li></ul></ul><ul><ul><li>Find/report problems </li></ul></ul><ul><li>Constraints </li></ul><ul><ul><li>Robot exclusion protocol and politeness </li></ul></ul><ul><ul><li>Deep web </li></ul></ul>
  11. 11. Web indexing <ul><li>Logical view </li></ul><ul><ul><li>Tokenization </li></ul></ul><ul><ul><li>Stopwords removal </li></ul></ul><ul><ul><li>Stemming </li></ul></ul><ul><li>Creation of an inverted index </li></ul>
  12. 12. Inverted index
  13. 13. Challenges of indexing <ul><li>Index compression </li></ul><ul><li>Efficiency in top-K searches </li></ul><ul><ul><li>Sorting </li></ul></ul><ul><li>Index distribution </li></ul><ul><ul><li>By terms </li></ul></ul><ul><ul><li>By documents </li></ul></ul>
  14. 14. Web querying and ranking <ul><li>Keyword-based search is dominant paradigm </li></ul><ul><ul><li>No large-scale open-domain QA systems (yet) </li></ul></ul><ul><li>Relevance </li></ul><ul><ul><li>Vector space model and variants </li></ul></ul><ul><li>Query expansion </li></ul><ul><li>Latent semantic indexing </li></ul>
  15. 15. Web ranking <ul><li>Quality is the main problem </li></ul><ul><li>Link ranking </li></ul><ul><ul><li>Hypothesis 1: Topical locality of links </li></ul></ul><ul><ul><li>Hypothesis 2: Link implies endorsment </li></ul></ul><ul><li>PageRank </li></ul><ul><li>HITS </li></ul>
  16. 16. HITS
  17. 17. Rank manipulation <ul><li>“ The bubble of Web visibility ” </li></ul><ul><li>Content spam </li></ul><ul><ul><li>Keyword stuffing </li></ul></ul><ul><ul><li>Content hidding </li></ul></ul><ul><li>Link spam </li></ul><ul><ul><li>Link farms </li></ul></ul><ul><li>Cloaking </li></ul>
  18. 18. Web mining
  19. 19. Content mining <ul><li>Extraction of knowledge from Web pages </li></ul><ul><ul><li>BUT ... HTML is physical formatting </li></ul></ul><ul><ul><li>There is information loss </li></ul></ul>
  20. 20. Information loss
  21. 21. Aspects of content mining <ul><li>Information extraction </li></ul><ul><ul><li>Revert information loss </li></ul></ul><ul><li>Content classification </li></ul><ul><ul><li>Topic </li></ul></ul><ul><ul><li>Genre </li></ul></ul><ul><li>Sentiment analysis </li></ul>
  22. 22. Link mining <ul><li>Scale-free networks </li></ul>
  23. 23. Macroscopic view <ul><li>Bow-tie structure </li></ul>
  24. 24. Usage mining <ul><li>Logfile analysis </li></ul><ul><li>Query logs </li></ul><ul><li>Privacy issues </li></ul>
  25. 25. Emerging topics <ul><li>Mobile Web </li></ul><ul><li>Semantic Web </li></ul><ul><li>... </li></ul>

×