• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web Crawling [2006]

Web Crawling [2006]



Presentation for web mining course, 2006.

Presentation for web mining course, 2006.



Total Views
Views on SlideShare
Embed Views



3 Embeds 4

http://www.linkedin.com 2
http://www.slideshare.net 1
http://www.lmodules.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Web Crawling [2006] Web Crawling [2006] Presentation Transcript

    • Web Crawling Updated: August 2006 Carlos Castillo http://www.chato.cl/ http://www.tejedoresdelweb.com/ 1
    • What is a Web crawler? Recursive downloader of pages – starting from “seed” pages 2
    • History June 1993, Matthew Gray (undergrad/MIT) in www-talk mailing list: “I've written a Perl script that wanders the WWW collecting URLs, keeping track of where it's been and new hosts that it finds. Eventually, after hacking up the code to return some slightly more useful information, I will produce a searchable index of this”. 3
    • History June 1993, Matthew Gray (undergrad/MIT) in www-talk mailing list: “I've written a Perl script that wanders the WWW collecting URLs, keeping track of where it's been and new hosts that it finds. Eventually, after hacking up the code to return some slightly more useful information, I will produce a searchable index of this”. June 1994, Brian Pinkerton (Ph.D. student/UW) in comp.infosystem.announce: “The WebCrawler index is now available for searching! The index is broad: it contains information from as many different servers at possible. It's a great tool for locating several different starting points from exploring by hand. The current index is based on the contents of documents located on nearly 4000 servers, world-wide”. 4
    • What for? General web search – coverage vs quality problem Vertical web search – newsbot / newscrawler – spambot – image-bot – bibliography-bot – *-bot 5
    • What for? (cont.) Focused crawling – Input: description of topic ● Driving query and/or examples – Can be done ... ● Off-line (vertical portals / vortals) ● On-line (on-demand crawling) Web Characterization / Analysis – On a large collection – On a small site: e.g. link validation Mirroring / Archiving 6
    • Crawler Taxonomy 7
    • Crawling algorithms 8
    • The Crawling Problem Several objectives – Prioritize the best pages – Use the network efficiently – Keep the collection fresh – Avoid overloading servers 9
    • Web days/weeks/months Search engine User 10
    • Crawling policies (1) Selection policy – Visit more important pages (2) Re-visit policy – Keep repository fresh (3) Politeness policy – Avoid disrupting sites 11
    • 1. Selection policy Off-line limits – Max. number of hosts – Max. exploration depth – Max. number of pages – Per-host, per-domain limits – File format limits 12
    • Selection policy (cont.) On-line limits – Quality first! Techniques – BFS (good!) – BFS + something else (better!) Advanced stuff may or may not work – Avoid local “pockets” – Mix explore and exploit 13
    • Evaluating On-line selection 14
    • Evaluating ... (cont.) 15
    • Focused crawling Particular instance of on-line sel. policy Exploits topical locality The big question: – How to infer quality BEFORE downloading? 16
    • 2. Re-visit policy Events [Fetterly et al. 2004] – 65% pages do not change in 10 weeks – 30% pages have only minor changes – 5% pages really change What should we optimize? 17
    • Normal operation: polling request Web Search Server Engine response headers content 18
    • Cost functions Freshness, F(page) – 0 if local copy is DIFFERENT from current – 1 if local copy is EQUAL Age, A(page) – 0 is local copy is EQUAL – x if local copy has been DIFFERENT for the last x seconds Embarrassment – P(expired page appears in search results) 19
    • Freshness and Age Sync Update Freshness Time Age Time 20
    • Interrupt-based operation notify request Web Search Server Engine response 21
    • Strategies Uniform policy Proportional policy – Correct, but avoid items that are updated too often There are estimators – Based on time unchanged / visit – Based on last-modification date ● Not always available 22
    • Cooperation issues Objective: keep collection fresh Problems – Spamming: can't trust the web servers ● Can't rely on quality assertions about pages. – Privacy: can't trust the search engines ● Can't give them full access to everything. Technologies – Compression, Differences of content, Fingerprints – HTTP extensions, Web Services 23
    • Cooperation schemes (1 of 2) Serve meta­data/ Web Srch Notify changes Svr Eng Serve differences/ Web Srch Send differences Svr Eng Serve if­modified/ Web Srch Svr Eng Send pages Pipelining server/ Web Srch Svr Eng Send batches 24
    • Cooperation schemes (2 of 2) Web service waiting Web Search for crawlers: Server Engine FILTERING INTERFACE sync Small program running  Web Search at the server: Server Engine REMOTE AGENT sync 25
    • Cost-benefit (polling) Network Processing Benefit Serve meta­data Normal Serve  Normal differences Serve only Normal if­modified Serve batches Normal of pages Filtering  High 26interface
    • Cost-benefit (interrupt) Network Processing Benefit Send meta­data Normal Send differences Normal Send changed  Normal pages Send batch  Normal update Remote High 27agent
    • Will websites cooperate ? Large websites – Will cooperate is there is a benefit for them Websites that are customers of a search engine – Google sitemaps General websites – We can use to some extent the HTTP features such as pipelining or other future extensions 28
    • 3. Politeness Policy “Crawler Laws” 1) A robot must identify itself as such 2) A robot must not enter forbidden areas 3) A robot must not overload a server 29
    • Robot identification “Running a crawler (...) generates a fair amount of email and phone calls (...) there are always those who do not know what a crawler is, because this is the first one they have seen” [Brin and Page 1999] User-agent field – Include the URL of a web page Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) msnbot/1.0 (+http://search.msn.com/msnbot.htm) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html) 30
    • Robot exclusion protocol Server-wide exclusion – robots.txt file User-agent: * Disallow: /data/private Disallow: /cgi-bin Page-wise exclusion – META tags <meta name=”robots” content=”noindex,nofollow,nocache”> 31
    • Avoid server overload Waiting time between accesses – 30 seconds? – 10 seconds? – 15 seconds? – 10xt seconds (t=download time of last page) robots.txt solution: Crawl-delay: 45 32
    • Architecture 33
    • General architecture 34
    • Storage 35
    • Two-phase scheduling 36
    • Short-term scheduling 37
    • Short-term scheduling S(i) page sizes B bandwidth B(i) bandwidth used by sites – B(i) = P(i) / T* T* optimal time – T* = sum(P(i)) / B 38
    • Optimal Scenario 39
    • The optimum is not possible Politeness policy In general B(i) << B Moreover, B(i) varies Quality of service varies 40
    • Realistic scenario 41
    • Practical issues 42
    • Hidden web Or “deep web” (databases) – Can be huge! – Require form-based interactions 43
    • Problems Network – costs – Variable QoS – Latency breaks short-term scheduling – Connect OK, write fails silently (firewall) DNS is a bottleneck – DNS servers break under pressure – Temporary DNS failures – Wrong DNS records (sometimes malicious) 44
    • More problems HTTP – “accept” headers not honored – Range errors ● e.g.: 0-100k in a 20k file – No headers – 'Found' when you mean 'Error' ● “soft” redirect – Dates ● If-modified-since, last-modified may be wrong 45
    • Even more problem HTML does not follow std. URLs – Sessionids – Repeated components Contents: – Many blogs and forums – Repeated templates – Duplicates, near-duplicates 46
    • Conclusions Openness. The Web's greater advantage creates most of the problems: – Size – Variable quality – Co-existence of implementations There is an interesting general problem – Discover valuable elements based on partial information 47