Web Robots ISHAN MISHRAwww.IshanTech.org                    1
Outline   Robot applications   How it works   Cycle Avoidance                         2
Applications   Behavior of web robots       Wander from web site to site (recursively),       1. Fetching content,    ...
Where to Start: The “Root Set”        A               G           L           SB       C       D               M       N  ...
Cycle Avoidance      A        B         E                   B         E                   B       E                       ...
Loops   Cycles are bad for crawlers for there    reasons.       Spending robot’s time and space       Overwhelm the web...
Data structure for robot   Trees and hash table   Lossy presence bit maps   Checkpoints       Save the list of visited...
Canonicalizing URLs       Most web robots try to eliminate the        obvious aliases by “canonicalizing” URL        into...
Symbolic link cycles          /                              /index.html    subdir           index.html     subdir    inde...
Dynamic Virtual Web Spaces   It can be possible to publish a URL that looks like a normal    file but really is a gateway...
Malicious dynamic web spaceexample                              11
Techniques for avoiding loops   Canonicalizing URLs   Breath-first crawling   Throttling       Limit the number of pag...
Techniques for avoiding loops   Pattern detection       e.g., “subdir/subdir/subdir…”       e.g., “subdir/images/subdir...
Robotic HTTP   No different from any other HTTP client program.   Many robots try to implement the minimum    amount of ...
Identifying Request Header   User-Agent       Tell the server the robot’s name   From       Tell the email of the robo...
Virtual docroots cause trouble if no Host header is sent              Robot tries to request index.html              from ...
What else a robot should support   Support Virtual Hosting        Not including this can lead to robots identifying the ...
Misbehaving Robots   Runaway robot       Robots issue HTTP requests as fast as they can.   Stale URLs       Robots vis...
Excluding Robots                                          www.ncnu.edu.twRobot parses the robots.txt file anddetermines if...
robots.txt format   #allow google, csiebot to crawl the public parts    of our site, but no other robots are allowed to  ...
Robots Exclusion Standard        versionsVersion Title and description                              Date0.0      A Standar...
Robots.txt path matching        examplesRule path          URL path           Match?   Comments/tmp               /tmp    ...
HTML Robot-control Meta Tags   e.g.        <META NAME=“ROBOTS” CONTENT=directive-list>   Directive-list        NOINDEX...
Additional META tag directivesname=                content=      DescriptionDESCRIPTION          <text>        Allows an a...
Guidelines for web robotoperators (Robot Etiquette)                              25
Guidelines for web robotoperators (cont.)                           26
Guidelines for web robotoperators (cont.)                           27
Guidelines for web robotoperators (cont.)                           28
Guidelines for web robotoperators (cont.)                           29
Modern Search Engine             Architecture      User                                                                   ...
Full-Text Index                  31
Posting the QueryUser fills out HTML searchform (with a GET actionHTTP method) on site inbrowser and hits Submit          ...
Reference (HW#4) paper reading: “searching the Web” paper reading: “Hyperlink analysis for the Web,” IEEE Internet Compu...
Upcoming SlideShare
Loading in...5
×

Introduction to "robots.txt

2,464

Published on

What is Robots.txt :Robotstxt.in Provides all information about robots and other articles about writing well-behaved Web robots.

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Introduction to "robots.txt

  1. 1. Web Robots ISHAN MISHRAwww.IshanTech.org 1
  2. 2. Outline Robot applications How it works Cycle Avoidance 2
  3. 3. Applications Behavior of web robots  Wander from web site to site (recursively),  1. Fetching content,  2. Following hyperlinks,  3. Process the data they find. Colorful names  Crawlers,  Spiders,  Worms,  Bots 3
  4. 4. Where to Start: The “Root Set” A G L SB C D M N T U H I J O E F K P Q R 4
  5. 5. Cycle Avoidance A B E B E B E AB A C A C A ABC C D D D(a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and follows link, fetches B and fetches page C is back to A 5
  6. 6. Loops Cycles are bad for crawlers for there reasons.  Spending robot’s time and space  Overwhelm the web site.  Duplicate content. 6
  7. 7. Data structure for robot Trees and hash table Lossy presence bit maps Checkpoints  Save the list of visited URL to disk, in case the robot crashes Partitioning  Robot farms 7
  8. 8. Canonicalizing URLs Most web robots try to eliminate the obvious aliases by “canonicalizing” URL into a standard form, by:  adding “:80” to the hostname, if the port isn’t specified.  Converting all %xx escaped characters into their character equivalents.  Removing # tags 8
  9. 9. Symbolic link cycles / /index.html subdir index.html subdir index.html logo.gif(a) subdir is a directory (b) subdir is an upward symbolic link 9
  10. 10. Dynamic Virtual Web Spaces It can be possible to publish a URL that looks like a normal file but really is a gateway application. This application can generate HTML on the fly that contains links to imaginary URLs on the same server. When these imaginary URLs are requested, new imaginary URLs are generated. Such kind of malicious web server take the poor robot on an Alice-in-Wonderland journey through an infinite virtual space, even if the web server doesn’t really contain any files. Sometimes the robot is hard to detect this trap, because HTML and URLs may look very different all the time. For example, a CGI-based calendaring program 10
  11. 11. Malicious dynamic web spaceexample 11
  12. 12. Techniques for avoiding loops Canonicalizing URLs Breath-first crawling Throttling  Limit the number of pages the robot can fetch from a web site in a period of time. Limit URL size  Avoid symbolic cycle problem.  Problem: many sites use URLs to maintain user state. URL/site blacklist  vs. “excluding Robot” 12
  13. 13. Techniques for avoiding loops Pattern detection  e.g., “subdir/subdir/subdir…”  e.g., “subdir/images/subdir/images/subdir/…” Content fingerprinting  A checksum concept, while the odds of two different pages having the same check sum are small.  Message digest functions such as MD5 are popular for this purpose. Human monitoring  Should design your robot with diagnostics and logging, so human beings can easily monitor the robot’s process and be warned quickly if something unusual is happening. 13
  14. 14. Robotic HTTP No different from any other HTTP client program. Many robots try to implement the minimum amount of HTTP needed to request the content they seek. It is recommended that robot implementers send some basic header information to notify the site of the capabilities of the robot, the robot identify, and where it originated. 14
  15. 15. Identifying Request Header User-Agent  Tell the server the robot’s name From  Tell the email of the robot’s user/admin email. Accept  Tell the server what media types are okay to send. (e.g. only fetch text and sound). Referer  Tell the server how a robot found links to this site’s content. 15
  16. 16. Virtual docroots cause trouble if no Host header is sent Robot tries to request index.html from www.csie.ncnu.edu.tw, but does Servers is configured to serve not include a Host header. both sites, but serves www.ncnu.edu.tw by default.Web robot clientRequest messageGET /index.html HTTP/1.0User-agent: ShopBot 1.0 www.ncnu.edu.tw www.csie.ncnu.edu.tw Response message HTTP/1.0 200 OK […] <HTML> <TITLE>National Chi Nan University</TITLE> […] 16
  17. 17. What else a robot should support Support Virtual Hosting  Not including this can lead to robots identifying the wrong content with a particular URL. Conditional Requests  To minimize the amount of content retrieved, by conditional HTTP requests. (like cache revalidation) Response Handling  Status code: 200 OK, 404 Not Found, 304  Entities: <meta http-equiv=“refresh” content”1; URL=index.html”> User-Agent Targeting  Web master should keep in mind that many robot will visit their site. Many sites optimize content for various user agents (I.E. or netscape).  Problem: “your browser does not support frame.” 17
  18. 18. Misbehaving Robots Runaway robot  Robots issue HTTP requests as fast as they can. Stale URLs  Robots visit the old lists of URLs. Long, wrong URLs  May reduce web server’s performance, clutter server’s access logs, even crash server. Nosy robots  Some robots may get URLs that point to private data and make that data easily accessible through search engine. Dynamic gateway access  Robots don’t always know what they are accessing. 18
  19. 19. Excluding Robots www.ncnu.edu.twRobot parses the robots.txt file anddetermines if it is allowed to accessthe acetylene-torches.html file.It is, so it proceeds with the request. 19
  20. 20. robots.txt format #allow google, csiebot to crawl the public parts of our site, but no other robots are allowed to crawl anything of our sites User-Agent: googlebot User-Agent: csiebot Disallow: /private User-Agent: * Disallow: 20
  21. 21. Robots Exclusion Standard versionsVersion Title and description Date0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994 original robot.txt mechanism with Disallow directive1.0 A Method for Web Robots Control-Martijn Nov. 1996 Koster’s IETF draft with additional support for Allow2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996 Conner’s extension including regex and timing information; not widely supported 21
  22. 22. Robots.txt path matching examplesRule path URL path Match? Comments/tmp /tmp ˇ Rule path==URL path/tmp /tmpfile.html ˇ Rule path is a prefix of URL path/tmp /tmp/a.html ˇ Rule path is a prefix of URL path/tmp/ /tmp x /tmp/ is not a prefix of /tmp README.TXT ˇ Empty rule path matches everything/~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~/%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~/%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes/~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a special case that must match exactly 22
  23. 23. HTML Robot-control Meta Tags e.g.  <META NAME=“ROBOTS” CONTENT=directive-list> Directive-list  NOINDEX  Not to process this document content  NOFOLLOW  Not to crawl any outgoing links from this page  INDEX  FOLLOW  NOARCHIVE  Should not cache a local copy of the page  ALL (equivalent to INDEX, FOLLOW)  NONE (equivalent to NOINDEX, NOFOLLOW) 23
  24. 24. Additional META tag directivesname= content= DescriptionDESCRIPTION <text> Allows an author to define a short text summary of the web page. Many search engines look at META DESCROPTION tags,allowing page author to specify appropriate short abstracts to describe their web pages. <meta name=“description” content=“Welcome to Mary’s Antiques web site”>KEYWORDS <comma Associates a comma-separated list of words that describes the list> web page, to assist in keyword searches. <meta name=“keywords” content=“antiques,mary,furniture,restoration”>REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be revisited, presumably because it is subject to change, after the specified number of days. <meta name=“revisit-after” content=“10 days”>* This directive is not likely to have wide support. 24
  25. 25. Guidelines for web robotoperators (Robot Etiquette) 25
  26. 26. Guidelines for web robotoperators (cont.) 26
  27. 27. Guidelines for web robotoperators (cont.) 27
  28. 28. Guidelines for web robotoperators (cont.) 28
  29. 29. Guidelines for web robotoperators (cont.) 29
  30. 30. Modern Search Engine Architecture User Web server User Web server Web search Search engine gateway crawler/indexer User Full-text index database Web server UserWeb search users Query engine Crawling and indexing 30
  31. 31. Full-Text Index 31
  32. 32. Posting the QueryUser fills out HTML searchform (with a GET actionHTTP method) on site inbrowser and hits Submit Client Query:”drills”Request message Results:File”BD.html”GET /search.html?query=drills HTTP/1.1Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.twAccept: *User-agent: ShopBot Response message Search gateway HTTP/1.1 200 OK Content-type: text/html Content-length: 1037 <HTML> <HEAD><TITLE>Search Results</TITLE> […] 32
  33. 33. Reference (HW#4) paper reading: “searching the Web” paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.http://www.searchtools.com Search Tools for Web Sites and Intranets-resources for search tools and robots.http://www.robotstxt.org/wc/robots.html The Web Robots Pages-resources for robot developers, including the registry of Internet Robots.http://www.searchengineworld.com Search Engine World-resource for search engines and robots.http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm RobotRules Perl source.http://www.conman.org/people/spc/robots2.html An Extended Standard for Robot Exclusion.Managing Gigabytes: Compressing and Indexing Documents and Images Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33

×