• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Current challenges in web crawling
 

Current challenges in web crawling

on

  • 7,627 views

Tutorial at ICWE'13, Aalborg, Denmark, 08.07.2013

Tutorial at ICWE'13, Aalborg, Denmark, 08.07.2013

Statistics

Views

Total Views
7,627
Views on SlideShare
7,626
Embed Views
1

Actions

Likes
9
Downloads
140
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Current challenges in web crawling Current challenges in web crawling Presentation Transcript

    • Current challenges in Web Crawling ICWE 2013 Tutorial Denis Shestakov Department of Media Technology School of Science, Aalto University, Finland firstname.lastname@aalto.fi Version 1.4: 08.07.2013
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 2/79 Speaker’s Bio Postdoc in Web Services Group, Aalto University, Finland PhD dissertation on limited coverage of web crawlers Over ten years of experience in the area
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 3/79 Speaker’s Bio http://www.linkedin.com/in/ dshestakov http://www.mendeley.com/ profiles/denis-shestakov/ http://www.tml.tkk.fi/~denis/
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 4/79 Tutorial Outline OVERVIEW Web crawling in a nutshell Web structure& statistics Large-scale crawling Coffee Break CHALLENGES Collaborative web crawling Crawling the deep Web Crawling the multimedia content Future directions
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 5/79 PART I: OVERVIEW Vizualization of http://media.tkk.fi/webservices by aharef.info applet
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 6/79 Outline of Part I Overview of Web Crawling Web Crawling in a Nutshell Applications Industry vs. Academia Web Ecosystem and Crawling Web Structure& Statistics Large-scale crawling Basic architecture Implementations Design issues and considerations
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 7/79 Web Crawling in a Nutshell Automatic harvesting of web content Done by web crawlers (also known as robots, bots or spiders) Follow a link from a set of links (URL queue), download a page, extract all links, eliminate already visited, add the rest to the queue Then repeat A set of policies involved (like ’ignore links to images’, etc.)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 8/79 Web Crawling in a Nutshell Example: 1. Follow http://media.tkk.fi/webservices (vizualization of its HTML DOM tree below) 2. Extract URLs inside blue bubbles (designating <a> tags) 3. Remove already visited URLs 4. For each non-visited URL, start at Step 1
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 9/79 Web Crawling in a Nutshell In essence: simple and naive process However, a number of ’restrictions’ imposed make it much more complicated Most complexities due to operating environment (Web) For example, do not overload web servers (challenging as distribution of web pages on web servers is non-uniform) Or avoiding web spam (not only useless but consumes resources and often spoils the collected content)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 10/79 Web Crawling in a Nutshell Crawler Agents First in 1993: the Wanderer (written in Perl) Over different 1100 crawler signatures (User-Agent string in HTTP request header) mentioned at http://www.crawltrack.net/crawlerlist.php Educated guess on overall number of different crawlers – at least several thousands Write your own in a few dozens lines of code (using libraries for URL fetching and HTML parsing) Or use existing agent: e.g., wget tool (developed from 1996; http://www.gnu.org/software/wget/)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 11/79 Web Crawling in a Nutshell Crawler Agents For advanced things, you may modify the code of existing projects for programming language preferred Crawlers play a big role on the Web Bring more traffic to certain web sites than human visitors Generate sizeable portion of traffic to any (public) web site Crawler traffic important for emerging web sites
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 12/79 Web Crawling in a Nutshell Classification General/universal crawlers - Not so many of them, lots of resources required - Big web search engines Topical/focused crawlers - Pages/sites on certain topic - Crawling all in one specific (i.e., national) web segment is rather general, though Batch crawling - One or several (static) snapshots Incremental/continuous crawling - Re-visiting - Resources divided between fetching newly discovered pages and re-downloading previously crawled pages - Search engines
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 13/79 Applications of Web Crawling Web Search Engines Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex, Ask, ... One of three underlying technology stacks
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 14/79 Applications of Web Crawling Web Search Engines One of three underlying technology stacks BTW, what are the other two and which is the most ’crucial’?
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 15/79 Applications of Web Crawling Web Search Engines What are the other two and which is the most ’crucial’? Query processor (particularly, ranking)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 16/79 Applications of Web Crawling Web Archiving Digital preservation ’Librarian’ look on the Web The biggest: Internet Archive Quite huge collections Batch crawls Primarily, collection of national web sites - web sites at country-specific TLDs or physically hosted in a country There are quite many and some are huge! see the list of Web Archiving Initiatives at Wikipedia
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 17/79 Applications of Web Crawling Vertical Search Engines Data aggregating from many sources on certain topic E.g., apartment search, car search
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 18/79 Applications of Web Crawling Web Data Mining “To get data to be actually mined” Usually using focused crawlers For example, opinion mining Or digests of current happenings on the Web (e.g., what music people listen now)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 19/79 Applications of Web Crawling Web Monitoring Monitoring sites/pages for changes and updates
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 20/79 Applications of Web Crawling Detection of malicious web sites Typically a part of anti-virus, firewall, search engine, etc. service Building a list of such web sites and inform a user about potential threat of visiting such
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 21/79 Applications of Web Crawling Web site/application testing Crawl a web site to check a navigation through it, validity the links, etc. Regression/security/... testing a rich internet application (RIA) via crawling Checking different application states by simulating possible user interaction events (e.g., mouse click, time-out)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 22/79 Applications of Web Crawling Fighting crime! :) well, copyright violations Crawl to find (media) items under copyright or links to them Regular re-visiting ’suspicious’ web sites, forums, etc. Tasks like finding terrorist chat rooms also go here
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 23/79 Applications of Web Crawling Web Scraping Extracting particular pieces of information from a group of typically similar pages When API to data is not available Interestingly, scraping might be more preferable even with API available as scraped data often more clean and up-to-date than data-via-API
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 24/79 Applications of Web Crawling Web Mirroring Copying of web sites Often hosting copies on different servers to ensure constant accessibility
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 25/79 Industry vs. Academia In web crawling domain Huge lag between industrial and academic web crawlers - Research-wise and development-wise - Algorithms, techniques, strategies used in industrial crawlers (namely, operated by search engines) poorly known Industrial crawlers operate on a web-scale (=dozens of billions pages) - Only a few (three?) academic crawlers dealt with more than one billion pages - Academic scale is rather hundreds of millions
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 26/79 Industry vs. Academia Re-crawling - Batch crawls in academia - Regular re-crawls by industrial crawlers Evaluation of crawled data - And hence corrections/improvements into crawlers - Direct evaluation by users of search engines - To some extent, artificial evaluation of academic crawls
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 27/79 Industry vs. Academia Industrial (search engines’) crawlers are much more appreciated - Eventually they attract visitors (=revenue/prestige/influence/...) - It makes perfect sense to trick them Academic crawlers just consume resources (e.g., network bandwidth) - Don’t bring anything - No point to do tricks with them (assuming site administrator bothers to differentiate them from search engines’ bots)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 28/79 Web Ecosystem and Crawling Pull vs. Push model Web Content Provider (site owners) Web Aggregators (crawler operators) Aggregator pulls content Content is not pushed to aggregators
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 29/79 Web Ecosystem and Crawling Why not Push? Pull is just easier for both parties No ’agreement’ between provider and aggregator No specific protocols for content providers – serving content is enough Perhaps pull model is the reason why the Web is succeeded while earlier hypertext systems failed
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 30/79 Web Ecosystem and Crawling Why not Push? Still pull model has several disadvantages What are these?
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 31/79 Web Ecosystem and Crawling Why not Push? Still pull model has several disadvantages Avoiding redundant requests from crawlers, more control over the content from providers
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 32/79 Web Ecosystem and Crawling Crawler politeness Content providers possess some control over crawlers Via special protocols to define access to parts of a site Via direct banning of agents hitting a site too often
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 33/79 Web Ecosystem and Crawling Crawler politeness Robots.txt says what can(not) be crawled Sitemaps is newer protocol specifying access restrictions and other info No agent should visit any URL starting with “yoursite/notcrawldir”, except an agent called “goodsearcher” Example User-agent: * Disallow: yoursite/notcrawldir User-agent: goodsearcher Disallow:
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 34/79 Web Structure& Statistics Some numbers Number of pages per host is not uniform: most hosts contain only a few pages, others contain millions Roughly 100 links on a page Must try to keep all crawling threads busy According to Google statistics (over 4 billions pages, 2010): fetching a page takes 320KB (textual content plus all embeddings) Page has 10-100KB of textual (HTML) content on average One trillion URLs known by Google/Yahoo in 2008
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 35/79 Web Structure& Statistics Some numbers 20 million web pages in 1995 (indexed by AltaVista) One trillion (1012) URLs known by Google/Yahoo in 2008 - ’Independent’ search engine called Majestic12 (P2P-crawling) confirms one billion items Doesn’t mean one trillion indexed pages Supposedly, index has dozens times less pages Cool crawler facts: IRLbot crawler (running on one server) downloaded 6.4 billions pages over 2 months - Throughput: 1000-1500 pages per second - Over 30 billions discovered URLs
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 36/79 Web Structure& Statistics Bow-tie model of the Web Illustration taken from http://dx.doi.org/doi:10.1038/35012155
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 37/79 Basic Crawler Architecture Crawler crawls the Web Illustration taken from CMSC 476/676 course slides by Charles Nicholas
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 38/79 Basic Crawler Architecture Typically in a distributed fashion Illustration taken from CMSC 476/676 course slides by Charles Nicholas
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 39/79 Basic Crawler Architecture URL Frontier Include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Prioritization also helps
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 40/79 Basic Crawler Architecture Crawler Architecture Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 41/79 Basic Crawler Architecture DNS Given a URL, retrieve its IP address Distributed service – lookup latencies can be high (seconds) Critical component Common implementations of DNS lookup (e.g., nslookup) are synchronous: one request at a time Asynchronous DNS resolving Pre-caching Batch DNS resolving
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 42/79 Basic Crawler Architecture Content seen? If page fetched is already in the base/index, don’t process it Document fingerprints (shingles) Filtering Filter out URLs – due to ’politeness’, restrictions on crawl Fetched robots.txt are cached to avoid fetching them repeatedly Duplicate URL Elimination Check if an extracted+filtered URL has been already passed to frontier (batch crawling) More complicated in continuous crawling (different URL frontier implementation)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 43/79 Basic Crawler Architecture Distributed Crawling Run multiple crawl threads, under different processes (often at different nodes) Nodes can be geographically distributed Partition hosts being crawled into nodes
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 44/79 Basic Crawler Architecture Host Splitter Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 45/79 Implementations Popular languages: Perl, Java, Python, C/C++ HTTP fetching, HTML parser, asynchronous DNS resolving libraries Open-source, in Java: Heritrix, Nutch
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 46/79 Implementations Simple code example in Perl
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 47/79 Large-scale Crawling Objectives High web coverage High page freshness High content quality High download rate Internal and External factors Amount of hardware (I) Network bandwidth (I) Rate of web growth (E) Rate of web change (E) Amount of malicious content (i.e., spam, duplicates) (E)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 48/79 Large-scale Crawling Architecture of sequential crawler Seeds – list of starting URLs Order of page visits determined by frontier data structure Stop condition (e.g., X pages fetched) Illustration taken from Ch.8 Web Crawling by Filippo Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 49/79 Large-scale Crawling Graph Traversal Breadth first search - Implemented with QUEUE (FIFO) - Pages with shortest paths Depth first search - Implemented with STACK (LIFO)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 50/79 Large-scale Crawling Some implementation notes Get only the first part of pages (10-100KB) Detect redirection loops Handle all possible errors (e.g., server not responding), timeouts, etc. Deal with lots of invalid HTML Take care of dynamic pages - Some are ’spider traps’ (think of Next month link on a calendar) - E.g., limit number of pages per host
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 51/79 Large-scale Crawling Delays in crawling Resolving host to IP address Connecting a socket to server and sending request Receiving requested page in response Overlap delays by fetching many pages concurrently
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 52/79 Large-scale Crawling Architecture of concurrent crawler Illustration taken from Ch.8 Web Crawling by Filippo Menczer in Bing Liu’s Web Data Mining (Springer, 2007)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 53/79 Large-scale Crawling Design points: frontier data structure Most links on a page refer to the same site/server - Note: remember of virtual hosting Problem with a FIFO queue – too many requests to the same server Common policy is to delay next request by, say, 10 x time (it took to download last page from the server) ’Mercator’ scheme – have more additional queues to the frontier queue
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 54/79 Large-scale Crawling Design points: URL seen test To not add multiple instances of URL to the frontier For batch crawling, two operations required: insertion and membership testing For continuous crawling, one more operation: deletion URLs compressed (e.g., 10-byte hash value) In-memory implementations: hash table, Bloom filter Search engines keep all URLs in-memory in the crawling cluster (hash table partitioned across nodes; partitioning can be based on host part of URL)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 55/79 Large-scale Crawling Design points: URL seen test If in-memory not possible, disk-based hash table used with caching Limits crawling rate to tens of pages per second – disk lookups are slow To scale, sequential read/writes are faster and thus used ’Mercator/IRLbot’ scheme: combining (reading-writing) sorted URL (visited) hashes on disk with hashes of ’just extracted’ URLs Delay due to batch merging manageable
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 56/79 PART II: CHALLENGES
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 57/79 Outline of Part II Challenges in Web Crawling Collaborative Crawling Deep Web Crawling Crawling content behind search forms Crawling JavaScript-rich web sites Crawling Multimedia Other Challenges in Crawling Future Directions References
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 58/79 Collaborative Crawling Main considerations Lots of redundant crawling To get data (often on a specific topic) need to crawl broadly - Often lack of expertise when large crawl required - Often, crawl a lot, use only a small subset Too many redundant requests for content providers Idea: have one crawler doing very broad and intensive crawl and many parties accessing the crawled data via API - Specify filters to select required pages Crawler as a common service
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 59/79 Collaborative Crawling Some requirements Filter language for specifying conditions Efficient filter processing (millions filter to process) Efficient fetching (hundreds pages per second) Support real-time requests
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 60/79 Collaborative Crawling New component Process a stream of documents against a filter index
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 61/79 Collaborative Crawling Filter processing architecture
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 62/79 Collaborative Crawling Filter processing architecture
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 63/79 Collaborative Crawling Based on ’The architecture and implementation of an extensible web crawler’ by Hsieh, Gribble, Levy, 2010 (illustrations on slides 61-62 from Hsieh’s slides) E.g., 80legs provides similar crawling services In a way, it is reconsidering pull/push model of content delivery on the Web
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 64/79 Deep Web Crawling Visualization of http://amazon.com by aharef.info applet
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 65/79 Deep Web Crawling In a nutshell Problem is in yellow nodes (designating web form elements)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 66/79 Deep Web Crawling See slides on deep Web crawling at http://goo.gl/Oohoo
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 67/79 Crawling Multimedia Content The web is now multimedia platform Images, video, audio are integral part of web pages (not just supplementing them) Almost all crawlers, however, consider it as a textual repository One reason: indexing techniques for multimedia doesn’t reach yet the maturity required by interesting use cases/applications Hence, no real need to harvest multimedia But state-of-the-art multimedia retrieval/computer vision techniques already provide adequate search quality E.g., search for images with a cat and a man based on actual image content (not text around/close to image) In case of video: set of frames plus audio (can be converted to textual form)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 68/79 Crawling Multimedia Content Challenges in crawling multimedia Bigger load on web sites since files are bigger More apparent copyright issues More resources (e.g., bandwidth, storage place) required from a crawler More complicated duplicate resolving Re-visiting policy
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 69/79 Crawling Multimedia Content Approaches Utilize metadata info (fetch and analyse small metadata file to decide on full download) Intelligent crawling: better ranking of URLs in frontier (based on specified domain of crawl) Move from pull to push model API-directed crawling - Access to data via predefined APIs - Need in annotation/discovery of such APIs Technically: use additional component for multimedia crawl - With its own URL queue - Main crawler component provides it with URLs to multimedia - In return, it sends feedback to main crawler to better score links in frontier
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 70/79 Crawling Multimedia Content Scalable Multimedia Web Observatory of ARCOMEM project (http://www.arcomem.eu) Focus on web archiving issues Uses several crawlers - ’Standard’ crawler for regular web pages - API crawler to mine social media sources (e.g., Twitter, Facebook, YouTube, etc.) - Deep Web crawler able to extract information from pre-defined web sites Data can be exported in WARC (Web ARChive) files and in RDF
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 71/79 Other Crawling Challenges Ordering policy Resources are limited, while number of pages to visit essentially infinite Decision should be done based on URL itself PageRank-like metrics can be used More complicated in case of incremental crawls Focused crawling Avoid links leading to content out of the topic of interest Content of a page can be taken into account when decide if a particular link leads to Setting a good seed is a challenge
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 72/79 Other Crawling Challenges Re-visiting policy Generating good seed URLs Avoiding redundant content Avoid visiting duplicate pages (different URLs leading to identical or near-identical content) - Near-duplicates might be very tricky (think of a news item propagation on the Web) Avoid crawler traps Avoid useless content (i.e., web spam)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 73/79 Future Directions Collaborative crawling, mixed pull-push model Understanding site structure Deep Web crawling Media content crawling Social network crawling
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 74/79 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. ClueWeb09 Dataset: - http://lemurproject.org/clueweb09.php/ - One billion web pages, in ten languages - 5TBs compressed - Hosted at several cloud services (free license required) or a copy can be ordered on hard disks (pay for disks) ClueWeb12: - Almost 900 millions English web pages
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 75/79 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Common Crawl Corpus: - See http://commoncrawl.org/data/accessing-the-data/ and http://aws.amazon.com/datasets/41740 - Around six billion web pages - Over 100TB uncompressed - Available as Amazon Web Services’ public dataset (pay for processing)
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 76/79 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Internet Archive: - See http://blog.archive.org/2012/10/26/ 80-terabytes-of-archived-web-crawl-data-available-for-resea - Crawl of 2011 - 80TB WARC files - 2.7 billions pages - Includes multimedia data - Available by request
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 77/79 References: Crawl Datasets LAW Datasets: - http://law.dsi.unimi.it/datasets.php - Variety of web graphs datasets (nodes, arcs, etc.) including basic properties of recent Facebook graphs (!) - Thoroughly studied in a number of publications ICWSM 2011 Spinn3r Dataset: - http://www.icwsm.org/data/ - 130mln blog posts and 230mln social media publications - 2TB compressed Academic Web Link Database Project: - http://cybermetrics.wlv.ac.uk/database/ - Crawls of national universities web sites
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 78/79 References: Literature For beginners: Udacity/CS101 course; http://www.udacity.com/overview/Course/cs101 Intermediate: Chapter 20 of Introduction to Information Retrieval book by Manning, Raghavan, Schütze; http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf Advanced: Web Crawling by Olston and Najork; http://www.nowpublishers.com/product.aspx?product= INR&doi=1500000017
    • Denis Shestakov Current Challenges in Web Crawling ICWE’13, Aalborg, Denmark, 08.07.2013 79/79 References: Literature See relevant publications at Mendeley: http://www.mendeley.com/groups/531771/web-crawling/ Feel free to join the group! Check ’Deep Web’ group too http://www.mendeley.com/groups/601801/deep-web/