Intelligent web crawling

14,296 views

Published on

Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges

Published in: Technology, Design
2 Comments
30 Likes
Statistics
Notes
No Downloads
Views
Total views
14,296
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
338
Comments
2
Likes
30
Embeds 0
No embeds

No notes for slide

Intelligent web crawling

  1. 1. INTELLIGENT WEB CRAWLING WI-IAT 2013 Tutorial WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013 ver 1.8: 10.04.2015 Denis Shestakov denshe at gmail Department of Media Technology, Aalto University, Finland
  2. 2. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 1/98 References to this tutorial To cite please use: D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent Informatics Bulletin, 14(1), pp. 5-7, 2013. [BibTeX]
  3. 3. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 2/98 Speaker’s Bio (2009-2013) Postdoc in Web Services Group, Aalto University, Finland PhD thesis (2008) on limited coverage of web crawlers Over ten years of experience in the area Tutorials on web crawling given at SAC’12 and ICWE’13 Web Services Group in 2011
  4. 4. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 3/98 Speaker’s Info As of 2013: Current: http://www.linkedin.com/in/dshestakov http://www.mendeley.com/profiles/ denis-shestakov/ http://www.researchgate.net/profile/ Denis_Shestakov https://mediatech.aalto.fi/~denis/
  5. 5. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 4/98 TUTORIAL OUTLINE I. OVERVIEW Web crawling in a nutshell Web crawling applications Web size and web link structure II. INTELLIGENT WEB CRAWLING Architecture of web crawler Crawling strategies Adaptive crawling approaches III. OPEN CHALLENGES Crawlers in Web ecosystem Collaborative web crawling Deep Web crawling Crawling multimedia content
  6. 6. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 5/98 Links to Tutorial Slides: http://goo.gl/woVtQk http://www.slideshare.net/denshe/presentations Similar tutorials: Tutorials on web crawling at ICWE’13 and SAC’12 Their diffs with this tutorial: better overview the topic (parts I and III), but not cover crawling strategies (part II) Supporting materials: http://www.mendeley.com/groups/531771/web-crawling/
  7. 7. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 6/98 PART I: OVERVIEW Visualization of http://media.tkk.fi/webservices by aharef.info applet
  8. 8. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 7/98 Outline of Part I Overview of Web Crawling Web crawling in a nutshell Web crawling applications Web size and web link structure
  9. 9. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 8/98 Web Crawling in a Nutshell Automatic harvesting of web content Done by web crawlers (also known as robots, bots or spiders) Follow a link from a set of links (URL queue), download a page, extract all links, eliminate already visited, add the rest to the queue Then repeat Set of policies involved (like ’ignore links to images’, etc.)
  10. 10. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 9/98 Web Crawling in a Nutshell Example: 1. Follow http://media.tkk.fi/webservices (vizualization of its HTML DOM tree below) 2. Extract URLs inside blue bubbles (designating <a> tags) 3. Remove already visited URLs 4. For each non-visited URL, start at Step 1
  11. 11. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 10/98 Web Crawling in a Nutshell In essence: simple and naive process However, a number of ’restrictions’ imposed make it much more complicated Most complexities due to operating environment (Web) For example, do not overload web servers (challenging as distribution of web pages on web servers is non-uniform) Or avoiding web spam (not only useless but consumes resources and often spoils the collected content)
  12. 12. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 11/98 Web Crawling in a Nutshell Crawler Agents First in 1993: the Wanderer (written in Perl) Over different 1100 crawler signatures (User-Agent string in HTTP request header) mentioned at http://www.crawltrack.net/crawlerlist.php Educated guess on overall number of different crawlers – at least several thousands Write your own in a few dozens lines of code (using libraries for URL fetching and HTML parsing) Or use existing agent: e.g., wget tool (developed from 1996; http://www.gnu.org/software/wget/)
  13. 13. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 12/98 Web Crawling in a Nutshell Crawler Agents For advanced things, you may modify the code of existing projects for programming language preferred Crawlers play a big role on the Web Bring more traffic to certain web sites than human visitors Generate sizeable portion of traffic to any (public) web site Crawler traffic important for emerging web sites
  14. 14. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 13/98 Web Crawling in a Nutshell Classification General/universal crawlers Not so many of them, lots of resources required Big web search engines Topical/focused crawlers Pages/sites on certain topic Crawling all in one specific (i.e., national) web segment is rather general, though Batch crawling One or several (static) snapshots Incremental/continuous crawling Re-visiting Resources divided between fetching newly discovered pages and re-downloading previously crawled pages Search engines
  15. 15. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 14/98 Applications of Web Crawling Web Search Engines Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex, Ask, ... One of three underlying technology stacks
  16. 16. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 15/98 Applications of Web Crawling Web Search Engines One of three underlying technology stacks BTW, what are the other two and which is the most ’crucial’?
  17. 17. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 16/98 Applications of Web Crawling Web Search Engines What are the other two and which is the most ’crucial’? Query processor (particularly, ranking)
  18. 18. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 17/98 Applications of Web Crawling Web Archiving Digital preservation “Librarian” look on the Web The biggest: Internet Archive Quite huge collections Batch crawls Primarily, collection of national web sites – web sites at country-specific TLDs or physically hosted in a country There are quite many and some are huge! see the list of Web Archiving Initiatives at Wikipedia
  19. 19. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 18/98 Applications of Web Crawling Vertical Search Engines Data aggregating from many sources on certain topic E.g., apartment search, car search
  20. 20. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 19/98 Applications of Web Crawling Web Data Mining “To get data to be actually mined” Usually using focused crawlers For example, opinion mining Or digests of current happenings on the Web (e.g., what music people listen now)
  21. 21. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 20/98 Applications of Web Crawling Web Monitoring Monitoring sites/pages for changes and updates
  22. 22. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 21/98 Applications of Web Crawling Detection of malicious web sites Typically a part of anti-virus, firewall, search engine, etc. service Building a list of such web sites and inform a user about potential threat of visiting such
  23. 23. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 22/98 Applications of Web Crawling Web site/application testing Crawl a web site to check a navigation through it, validity the links, etc. Regression/security/... testing a rich internet application (RIA) via crawling Checking different application states by simulating possible user interaction events (e.g., mouse click, time-out)
  24. 24. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 23/98 Applications of Web Crawling Copyright violation detection Crawl to find (media) items under copyright or links to them Regular re-visiting ’suspicious’ web sites, forums, etc. Tasks like finding terrorist chat rooms also go here
  25. 25. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 24/98 Applications of Web Crawling Web Scraping Extracting particular pieces of information from a group of typically similar pages When API to data is not available Interestingly, scraping might be more preferable even with API available as scraped data often more clean and up-to-date than data-via-API
  26. 26. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 25/98 Applications of Web Crawling Web Mirroring Copying of web sites Hosting copies on different servers to ensure 24x7 accessibility
  27. 27. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 26/98 Industry vs. Academia Divide In web crawling domain Huge lag between industrial and academic web crawlers Research-wise and development-wise Algorithms, techniques, strategies used in industrial crawlers (namely, operated by search engines) poorly known Industrial crawlers operate on a web-scale That is, dozens of billions pages Only a few academic crawlers dealt with more than one billion pages Academic scale is rather hundreds of millions
  28. 28. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 27/98 Industry vs. Academia Re-crawling Batch crawls in academia Regular re-crawls by industrial crawlers Evaluation of crawled data Crucial for corrections/improvements into crawlers Direct evaluation by users of search engines To some extent, artificial evaluation of academic crawls
  29. 29. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 28/98 Web Size and Structure Some numbers Number of pages per host is not uniform: most hosts contain only a few pages, others contain millions Roughly 100 links on a page According to Google statistics (over 4 billions pages, 2010): fetching a page takes 320KB (textual content plus all embeddings) Page has 10-100KB of textual (HTML) content on average One trillion URLs known by Google/Yahoo in 2008
  30. 30. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 29/98 Web Size and Structure Some numbers 20 million web pages in 1995 (indexed by AltaVista) One trillion (1012) URLs known by Google/Yahoo in 2008 - ’Independent’ search engine called Majestic12 (P2P-crawling) confirms one trillion items Doesn’t mean one trillion indexed pages Supposedly, index has dozens times less pages Cool crawler facts: IRLbot crawler (running on one server) downloaded 6.4 billion pages over 2 months Throughput: 1000-1500 pages per second Over 30 billion discovered URLs
  31. 31. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 30/98 Web Size and Structure Bow-tie model of the Web Illustration taken from http://dx.doi.org/doi:10.1038/35012155
  32. 32. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 31/98 PART II: INTELLIGENT WEB CRAWLING
  33. 33. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 32/98 Outline of Part II Intelligent Web Crawling Architecture of web crawler Crawling strategies Adaptive crawling approaches
  34. 34. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 33/98 Architecture of Web Crawler Crawler crawls the Web Crawled URLs URL Frontier Seed URLs Uncrawled Web
  35. 35. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 34/98 Architecture of Web Crawler Typically in a distributed fashion Seed URLs Crawled URLs URL Frontier crawling thread Uncrawled Web
  36. 36. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 35/98 Architecture of Web Crawler URL Frontier Include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Prioritization also helps
  37. 37. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 36/98 Architecture of Web Crawler Crawler Architecture Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  38. 38. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 37/98 Architecture of Web Crawler Content seen? If page fetched is already in the base/index, don’t process it Document fingerprints (shingles) Filtering Filter out URLs – due to ’politeness’, restrictions on crawl Fetched robots.txt are cached to avoid fetching them repeatedly Duplicate URL Elimination Check if an extracted+filtered URL has been already passed to frontier (batch crawling) More complicated in continuous crawling (different URL frontier implementation)
  39. 39. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 38/98 Architecture of Web Crawler Distributed Crawling Run multiple crawl threads, under different processes (often at different nodes) Nodes can be geographically distributed Partition hosts being crawled into nodes
  40. 40. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 39/98 Architecture of Web Crawler Host Splitter Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  41. 41. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 40/98 Architecture of Web Crawler Implementation (in Perl) Other popular languages: Java, Python, C/C++
  42. 42. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 41/98 Architecture of Web Crawler Crawling objectives High web coverage High page freshness High content quality High download rate Internal and External factors Amount of hardware (I) Network bandwidth (I) Rate of web growth (E) Rate of web change (E) Amount of malicious content (i.e., spam, duplicates) (E)
  43. 43. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 42/98 Crawling Strategies Download prioritization Given a period, only a subset of web pages can be downloaded “Important” pages first Hence, need in prioritization Ordering a queue of URLs to be visited Strategies (ordering metrics) Breadth-First, Depth-First Backlink count Best-First PageRank Shark-Search
  44. 44. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 43/98 Crawling Strategies Breadth-First, Depth-First Breadth-First search Implemented with QUEUE (FIFO) Pages with shortest paths first Depth-First search Implemented with STACK (LIFO)
  45. 45. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 44/98 Crawling Strategies Pseudocode for Breadth-First
  46. 46. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 45/98 Crawling Strategies Backlink count Use the link graph information Count # of crawled pages that point to a page Links with highest counts first
  47. 47. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 46/98 Crawling Strategies Best-First Best link selected based on some criterion I.e., lexical similarity between topic’s keywords and link’s source page Similarity score sim(topic, p) assigned to outgoing links of page p Cosine similarity often used where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k in q and p
  48. 48. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 47/98 Crawling Strategies Pseudocode for Best-First
  49. 49. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 48/98 Crawling Strategies PageRank The pagerank of a page is the probability for a random surfer (who follows links randomly) to be on this page at any given time A page’s score (rank) defined by scores of pages with links to this page where p is a page, in(p) is a set of pages with links to p, out(d) is a set of links out of d, γ are damping factor PageRank of pages periodically recalculated using data structure with crawled pages
  50. 50. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 49/98 Crawling Strategies Pseudocode for PageRank
  51. 51. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 50/98 Crawling Strategies Shark-Search More emphasis on web segments where relevant pages were found Penalizing segments yielding a few relevant pages A link’s score defined by a link’s anchor text, text surrounding a link (link context) and inherited score from ancestor pages (pages pointing to a page with this link) Parameters: d - depth bound r - relative importance of inherited score versus link neighbourhood score
  52. 52. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 51/98 Crawling Strategies Pseudocode for Shark-Search
  53. 53. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 52/98 Adaptive Crawling Static vs. adaptive strategies Strategies presented to this point are static Not adjust in the course of the crawl Adaptive (intelligent) crawling InfoSpiders Ant-based crawling
  54. 54. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 53/98 Adaptive Crawling InfoSpiders Independent agents crawling in parallel HTML parser Noise word remover Stemmer Document relevance assessment Reproduction or death Learning Link assessment and selection HTML document Compact document representation Document assessment ########## $$$ ########## $$$ Term weights Neural net weights Keyword vector Agent representation
  55. 55. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 54/98 Adaptive Crawling InfoSpiders Independent agents crawling in parallel Each agent uses list of keywords (initialized with topic keywords) Neural network evaluates new links Keywords in the vicinity a link used as input More importance (weight) to those keywords close to a link Maximum to words in the anchor text Output is a numerical quality estimate for a link Link score combined with cosine similarity score (between agent’s keywords and a page with this link)
  56. 56. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 55/98 Adaptive Crawling InfoSpiders Each agent has an energy level Agent moves from a current to a new page if boltzmann function returns true where δ is diff between similarity of new and current page to agent’s keywords If energy level passes some threshold, an agent reproduces Offspring gets the half of parent’s frontier Offspring keywords mutated (expanded) with most frequent terms in parent’s current document
  57. 57. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 56/98 Adaptive Crawling Pseudocode for InfoSpiders
  58. 58. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 57/98 Adaptive Crawling Pseudocode for InfoSpiders (cont.)
  59. 59. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 58/98 Adaptive Crawling Ant-based crawling Motivation: allow crawling agents to communicate with each other Follow a model of social insect collective behaviour Ants leave the pheromone along the followed path Other ants follow such pheromone trails A crawler agent follows some path by visiting many URLs At some moment, a certain amount of pheromone (weight) can be assigned to sequence of URLs on the followed path The amount can depend on similarity of visited pages to a given topic
  60. 60. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 59/98 Adaptive Crawling Ant-based crawling Ants (crawlers) operate in cycles During each cycle, agents make a predefined number of moves (visits of pages) #moves = constant ∗ #cycle At the end of each cycle, pheromone intensity values are updated for the followed path Agents-ants return to their starting positions
  61. 61. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 60/98 Adaptive Crawling Ant-based crawling Next link selected based on probability, which is defined by the corresponding pheromone intensity If no pheromone information, an agent-ant moves randomly
  62. 62. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 61/98 Adaptive Crawling Ant-based crawling Probability of selecting a link where t is the cycle number, τij (t) is pheromone value between pi and pj and (i, l) designates the presence of a link from pi to pl During the cycle, each ant stores the list of visited URLs If pj was already visited, Pij(t) = 0 At the end of cycle, the list with visited URLs emptied out
  63. 63. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 62/98 Adaptive Crawling Implications Strategies evaluating links based on their context (text close by) are not directly applicable to large-scale crawling I.e., consider crawling of 109 pages within one month Crawl rate: around 400 documents per second Around 40000 links per second Every second 10000-30000 “new” links to be evaluated (scored) and added to the frontier Too many even for link’s anchor text evaluation only
  64. 64. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 63/98 PART III: OPEN CHALLENGES
  65. 65. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 64/98 Outline of Part III Open Challenges Crawlers in Web ecosystem Collaborative web crawling Deep Web crawling Crawling multimedia content
  66. 66. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 65/98 Crawlers in Web ecosystem Push vs. Pull model Web pages accessed via pull model - HTTP is a pull protocol That is, a client requests a page from a server If push, a server would send a page/info to a client Why Pull? Pull is just easier for both parties No ’agreement’ between provider and aggregator No specific protocols for content providers – serving content is enough Perhaps pull model is the reason why the Web is succeeded while earlier hypertext systems failed
  67. 67. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 66/98 Crawlers in Web ecosystem Why not Push? Still pull model has several disadvantages What are these?
  68. 68. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 67/98 Crawlers in Web ecosystem Why not Push? Still pull model has several disadvantages Publishing/updating content easier with push: no need in redundant requests from crawlers Better control over the content from providers: no need in crawler politeness
  69. 69. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 68/98 Crawlers in Web ecosystem Crawler politeness Content providers possess some control over crawlers Via special protocols to define access to parts of a site Via direct banning of agents hitting a site too often
  70. 70. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 69/98 Crawlers in Web ecosystem Crawler politeness Robots.txt says what can(not) be crawled Sitemaps is newer protocol specifying access restrictions and other info No agent should visit any URL starting with “yoursite/notcrawldir”, except an agent called “goodsearcher” Example User-agent: * Disallow: yoursite/notcrawldir User-agent: goodsearcher Disallow:
  71. 71. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 70/98 Collaborative Crawling Main considerations Lots of redundant crawling To get data (often on a specific topic) need to crawl broadly - Often lack of expertise when large crawl required - Often, crawl a lot, use only a small subset Too many redundant requests for content providers Idea: have one crawler doing very broad and intensive crawl and many parties accessing the crawled data via API - Specify filters to select required pages Crawler as a common service
  72. 72. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 71/98 Collaborative Crawling Some requirements Filter language for specifying conditions Efficient filter processing (millions filter to process) Efficient fetching (hundreds pages per second) Support real-time requests
  73. 73. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 72/98 Collaborative Crawling New component Process a stream of documents against a filter index
  74. 74. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 73/98 Collaborative Crawling Filter processing architecture
  75. 75. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 74/98 Collaborative Crawling Filter processing architecture
  76. 76. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 75/98 Collaborative Crawling Based on ’The architecture and implementation of an extensible web crawler’ by Hsieh, Gribble, Levy, 2010 (illustrations on slides 61-62 from Hsieh’s slides) E.g., 80legs provides similar crawling services In a way, it is reconsidering pull/push model of content delivery on the Web
  77. 77. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 76/98 Deep Web Crawling Visualization of http://amazon.com by aharef.info applet
  78. 78. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 77/98 Deep Web Crawling In a nutshell Problem is in yellow nodes (designating web form elements)
  79. 79. ● Deep Web – part of the Web not accessible through search engines ● My preferred: Deep Web - content behind web search forms on publicly available pages ● Pages with forms themselves are typically accessible/searchable (=crawled) 1 Content hidden behind HTML forms Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  80. 80. Why is it important? Large source of structured data ● Forms present a search interface over backend databases Significant gap in search engine coverage ● Potentially more content that currently searchable ● More than 10 million distinct HTML forms ● Likely to increase and more data comes online Size of the deep Web is unclear ● 500x figures are highly disputable ● Number of resources is a bit simpler: ~450k databases on the Web in 2004 ● Some part of deep web content crawled/covered by search engines ● Content can be both searched and browsed via links categorizing content ● Business-driven sites (e.g., shopping) typically provide both ways of access 2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  81. 81. Can’t pass through the forms (need to specify some values) I.e., content is “hidden” behind search forms ● Reason for another name for deep Web: hidden Web To crawl/access the content behind the following is required: ● Identify a search form on a page ● Fill form with proper values ● Submit the form ● Get the result pages ● Extract links/data from them Why crawlers not crawl deep Web 3Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  82. 82. Approaches to deep Web crawling Google’s Deep Web Crawl (2008) ● Identify search forms ● Pre-compute all interesting form submissions to each HTML form ● Each form submission corresponds to a distinct URL ● Add URLs for each form submission into search engine index ● Allows to reuse existing search engine infrastructure ● No aim for full coverage of a deep web resource ● Not all forms (only GET forms) covered 4Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  83. 83. Deep Web site identification • Task: identify a search form leading to content-rich web pages • Surprisingly, quite challenging task • One of the problems: ● Detect if form is searchable 5Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  84. 84. Searchable forms Non-searchable: login forms, those that require user info Depends: Highly-interactive forms, e.g., airline reservations What are deep Web resources? store locations used cars radio stations patents recipes 6Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  85. 85. 7Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  86. 86. Deep Web site identification • Detect if form is informational ● Challenging for human too: e.g., assume a form is in unknown language • Detection by building/training binary classifiers • Forms identified as searchable can then be classified into domains (e.g., car search, apartment search, etc.) ● Based on form structure (e.g., num.fields) ● Based on form field labels • Slow process ● Done by specific component in offline mode 8Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  87. 87. Crawling JavaScript-rich sites • Web pages became more responsive, interactive, user-friendly, etc. ● Thanks to emergence of new web technologies such as AJAX • Besides, they led to wide spread of web applications (RIAs) • Challenge for crawlers as they do not ● Manipulate client-side site ● Take into account asynchronous communication with the server 9Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  88. 88. Crawling JavaScript-rich sites • Very similar to deep Web crawling challenge ● Content is hard to crawl ● Direct problem: AJAX/JS-enabled forms are hard to deal with (e.g., to detect and then generate meaningful queries) • Web pages designed for human beings, not for automatic programs • JS-code should be processed to get the actual content ● Dynamically changing ● Lots of additional resources required (crawler should be supplemented with JS-interpreter) 10Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  89. 89. Crawling JavaScript-rich sites • Several techniques for AJAX crawling proposed since 2007/08 ● Focus is either on indexing and searching or on testing RIAs • Approach: ● AJAX-enabled web page/application modeled using states, events, transitions ● Crawler uses breadth-first strategy: ● Triggers the events on a page ● If the DOM of a page changes then new state/transition is added to transition graph ● Back to initial state to invoke the next event 11Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  90. 90. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 89/98 Crawling Multimedia Content The web is now multimedia platform Images, video, audio are integral part of web pages (not just supplementing them) Almost all crawlers, however, consider it as a textual repository One reason: indexing techniques for multimedia doesn’t reach yet the maturity required by interesting use cases/applications Hence, no real need to harvest multimedia But state-of-the-art multimedia retrieval/computer vision techniques already provide adequate search quality E.g., search for images with a cat and a man based on actual image content (not text around/close to image) In case of video: set of frames plus audio (can be converted to textual form)
  91. 91. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 90/98 Crawling Multimedia Content Challenges in crawling multimedia Bigger load on web sites since files are bigger More apparent copyright issues More resources (e.g., bandwidth, storage place) required from a crawler More complicated duplicate resolving Re-visiting policy
  92. 92. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 91/98 Crawling Multimedia Content Scalable Multimedia Web Observatory of ARCOMEM project (http://www.arcomem.eu) Focus on web archiving issues Uses several crawlers - ’Standard’ crawler for regular web pages - API crawler to mine social media sources (e.g., Twitter, Facebook, YouTube, etc.) - Deep Web crawler able to extract information from pre-defined web sites Data can be exported in WARC (Web ARChive) files and in RDF
  93. 93. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 92/98 Future Directions Collaborative crawling, mixed pull-push model Scalable adaptive strategies Understanding site structure Deep Web crawling Semantic Web crawling Media content crawling Social network crawling
  94. 94. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 93/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. ClueWeb09 Dataset: - http://lemurproject.org/clueweb09.php/ - One billion web pages, in ten languages - 5TBs compressed - Hosted at several cloud services (free license required) or a copy can be ordered on hard disks (pay for disks) ClueWeb12: - Almost 900 millions English web pages
  95. 95. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 94/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Common Crawl Corpus: - See http://commoncrawl.org/data/accessing-the-data/ and http://aws.amazon.com/datasets/41740 - Around six billion web pages - Over 100TB uncompressed - Available as Amazon Web Services’ public dataset (pay for processing)
  96. 96. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 95/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Internet Archive: - See http://blog.archive.org/2012/10/26/ 80-terabytes-of-archived-web-crawl-data-available-for-resea - Crawl of 2011 - 80TB WARC files - 2.7 billions pages - Includes multimedia data - Available by request
  97. 97. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 96/98 References: Crawl Datasets LAW Datasets: - http://law.dsi.unimi.it/datasets.php - Variety of web graphs datasets (nodes, arcs, etc.) including basic properties of recent Facebook graphs (!) - Thoroughly studied in a number of publications ICWSM 2011 Spinn3r Dataset: - http://www.icwsm.org/data/ - 130mln blog posts and 230mln social media publications - 2TB compressed Academic Web Link Database Project: - http://cybermetrics.wlv.ac.uk/database/ - Crawls of national universities web sites
  98. 98. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 97/98 References: Literature For beginners: Udacity/CS101 course; http://www.udacity.com/overview/Course/cs101 Intermediate: Chapter 20 of Introduction to Information Retrieval book by Manning, Raghavan, Schütze; http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf Intermediate: Current Challenges in Web Crawling tutorial at ICWE 2013 by Shestakov; http://www.slideshare. net/denshe/icwe13-tutorial-webcrawling Advanced: Web Crawling by Olston and Najork; http://www.nowpublishers.com/product.aspx?product= INR&doi=1500000017
  99. 99. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 98/98 References: Literature See relevant publications at Mendeley: http://www.mendeley.com/groups/531771/web-crawling/ Feel free to join the group! Check ’Deep Web’ group too http://www.mendeley.com/groups/601801/deep-web/

×