SlideShare a Scribd company logo
Web Robots


 ISHAN MISHRA
www.IshanTech.org



                    1
Outline
   Robot applications
   How it works
   Cycle Avoidance




                         2
Applications
   Behavior of web robots
       Wander from web site to site (recursively),
       1. Fetching content,
       2. Following hyperlinks,
       3. Process the data they find.

   Colorful names
       Crawlers,
       Spiders,
       Worms,
       Bots


                                                      3
Where to Start: The “Root Set”


        A               G           L           S



B       C       D               M       N   T       U
                    H       I


                    J           O
    E       F


                    K       P   Q       R



                                                        4
Cycle Avoidance


      A        B         E                   B         E                   B       E


                                          AB

  A                C            A                C           A       ABC       C




           D                             D                            D


(a) Robot fetches page A,     (b) Robot follows link       (c) Robot follows link and
    follows link, fetches B       and fetches page C           is back to A
                                                                                    5
Loops
   Cycles are bad for crawlers for there
    reasons.
       Spending robot’s time and space
       Overwhelm the web site.
       Duplicate content.




                                            6
Data structure for robot
   Trees and hash table
   Lossy presence bit maps
   Checkpoints
       Save the list of visited URL to disk, in case the
        robot crashes
   Partitioning
       Robot farms


                                                            7
Canonicalizing URLs
       Most web robots try to eliminate the
        obvious aliases by “canonicalizing” URL
        into a standard form, by:
         adding “:80” to the hostname, if the port
          isn’t specified.
         Converting all %xx escaped characters into
          their character equivalents.
         Removing # tags


                                                       8
Symbolic link cycles

          /                              /




index.html    subdir           index.html     subdir




    index.html     logo.gif


(a) subdir is a directory     (b) subdir is an upward symbolic link


                                                                      9
Dynamic Virtual Web Spaces
   It can be possible to publish a URL that looks like a normal
    file but really is a gateway application.
   This application can generate HTML on the fly that
    contains links to imaginary URLs on the same server.
    When these imaginary URLs are requested, new imaginary
    URLs are generated.

   Such kind of malicious web server take the poor robot on
    an Alice-in-Wonderland journey through an infinite virtual
    space, even if the web server doesn’t really contain any
    files. Sometimes the robot is hard to detect this trap,
    because HTML and URLs may look very different all the
    time.

   For example, a CGI-based calendaring program
                                                                 10
Malicious dynamic web space
example




                              11
Techniques for avoiding loops
   Canonicalizing URLs
   Breath-first crawling
   Throttling
       Limit the number of pages the robot can fetch from a
        web site in a period of time.
   Limit URL size
       Avoid symbolic cycle problem.
       Problem: many sites use URLs to maintain user state.
   URL/site blacklist
       vs. “excluding Robot”

                                                               12
Techniques for avoiding loops
   Pattern detection
       e.g., “subdir/subdir/subdir…”
       e.g., “subdir/images/subdir/images/subdir/…”

   Content fingerprinting
       A checksum concept, while the odds of two different pages
        having the same check sum are small.
       Message digest functions such as MD5 are popular for this
        purpose.

   Human monitoring
       Should design your robot with diagnostics and logging, so
        human beings can easily monitor the robot’s process and be
        warned quickly if something unusual is happening.
                                                                     13
Robotic HTTP
   No different from any other HTTP client program.
   Many robots try to implement the minimum
    amount of HTTP needed to request the content
    they seek.

   It is recommended that robot implementers
    send some basic header information to notify
    the site of the capabilities of the robot, the robot
    identify, and where it originated.

                                                           14
Identifying Request Header
   User-Agent
       Tell the server the robot’s name
   From
       Tell the email of the robot’s user/admin email.
   Accept
       Tell the server what media types are okay to send.
        (e.g. only fetch text and sound).
   Referer
       Tell the server how a robot found links to this site’s
        content.


                                                                 15
Virtual docroots cause trouble if
 no Host header is sent


              Robot tries to request index.html
              from www.csie.ncnu.edu.tw, but does
                                                     Servers is configured to serve
              not include a Host header.
                                                     both sites, but serves
                                                     www.ncnu.edu.tw by default.
Web robot client
Request message
GET /index.html HTTP/1.0
User-agent: ShopBot 1.0
                                                        www.ncnu.edu.tw
                                                       www.csie.ncnu.edu.tw
                                     Response message
                                      HTTP/1.0 200 OK
                                      […]
                                      <HTML>
                                      <TITLE>National Chi Nan University</TITLE>
                                      […]                                        16
What else a robot should support
   Support Virtual Hosting
        Not including this can lead to robots identifying the wrong content with
         a particular URL.

   Conditional Requests
        To minimize the amount of content retrieved, by conditional HTTP
         requests. (like cache revalidation)

   Response Handling
        Status code: 200 OK, 404 Not Found, 304
        Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>

   User-Agent Targeting
        Web master should keep in mind that many robot will visit their site.
         Many sites optimize content for various user agents (I.E. or netscape).
        Problem: “your browser does not support frame.”


                                                                                    17
Misbehaving Robots
   Runaway robot
       Robots issue HTTP requests as fast as they can.
   Stale URLs
       Robots visit the old lists of URLs.
   Long, wrong URLs
       May reduce web server’s performance, clutter server’s access
        logs, even crash server.
   Nosy robots
       Some robots may get URLs that point to private data and make
        that data easily accessible through search engine.
   Dynamic gateway access
       Robots don’t always know what they are accessing.


                                                                       18
Excluding Robots


                                          www.ncnu.edu.tw


Robot parses the robots.txt file and
determines if it is allowed to access
the acetylene-torches.html file.

It is, so it proceeds with the request.




                                                            19
robots.txt format
   #allow google, csiebot to crawl the public parts
    of our site, but no other robots are allowed to
    crawl anything of our sites
   User-Agent: googlebot
   User-Agent: csiebot
   Disallow: /private

   User-Agent: *
   Disallow:
                                                       20
Robots Exclusion Standard
        versions

Version Title and description                              Date
0.0      A Standard for Robot Exclusion-Martijn Koster’s   June 1994
         original robot.txt mechanism with Disallow
         directive


1.0      A Method for Web Robots Control-Martijn           Nov. 1996
         Koster’s IETF draft with additional support for
         Allow


2.0      An Extended Standard for Robot Exclusion-Sean     Nov. 1996
         Conner’s extension including regex and timing
         information; not widely supported




                                                                       21
Robots.txt path matching
        examples
Rule path          URL path           Match?   Comments
/tmp               /tmp               ˇ        Rule path==URL path

/tmp               /tmpfile.html      ˇ        Rule path is a prefix of URL
                                               path
/tmp               /tmp/a.html        ˇ        Rule path is a prefix of URL
                                               path
/tmp/              /tmp               x        /tmp/ is not a prefix of /tmp

                   README.TXT         ˇ        Empty rule path matches
                                               everything
/~fred/hi.html     %7Efred/hi.html    ˇ        %7E is treated the same as ~

/%7Efred/hi.html   /~fred/hi.html     ˇ        %7E is treated the same as ~

/%7efred/hi.html   /%7Efred/hi.html   ˇ        Case isn’t significant in escapes

/~fred/hi.html     ~fred%2Fhi.html    x        %2F is slash, but slash is a
                                               special case that must match
                                               exactly                             22
HTML Robot-control Meta Tags
   e.g.
        <META NAME=“ROBOTS” CONTENT=directive-list>

   Directive-list
        NOINDEX
             Not to process this document content
        NOFOLLOW
             Not to crawl any outgoing links from this page

        INDEX
        FOLLOW
        NOARCHIVE
             Should not cache a local copy of the page
        ALL (equivalent to INDEX, FOLLOW)
        NONE (equivalent to NOINDEX, NOFOLLOW)


                                                               23
Additional META tag directives

name=                content=      Description
DESCRIPTION          <text>        Allows an author to define a short text summary of the web
                                   page. Many search engines look at META DESCROPTION
                                   tags,allowing page author to specify appropriate short
                                   abstracts to describe their web pages.
                                   <meta name=“description”
                                       content=“Welcome to Mary’s Antiques web site”>
KEYWORDS             <comma        Associates a comma-separated list of words that describes the
                     list>         web page, to assist in keyword searches.
                                   <meta name=“keywords”
                                      content=“antiques,mary,furniture,restoration”>

REVISIT-AFTER*       <no.days>     Instructs the robot or search engine that the page should be
                                   revisited, presumably because it is subject to change, after the
                                   specified number of days.
                                   <meta name=“revisit-after” content=“10 days”>


*   This directive is not likely to have wide support.                                                24
Guidelines for web robot
operators (Robot Etiquette)




                              25
Guidelines for web robot
operators (cont.)




                           26
Guidelines for web robot
operators (cont.)




                           27
Guidelines for web robot
operators (cont.)




                           28
Guidelines for web robot
operators (cont.)




                           29
Modern Search Engine
             Architecture


      User
                                                                            Web server




      User
                                                                            Web server
                   Web search                      Search engine
                   gateway                         crawler/indexer
      User
                                 Full-text index
                                 database
                                                                            Web server
      User

Web search users       Query engine                       Crawling and indexing
                                                                                         30
Full-Text Index




                  31
Posting the Query
User fills out HTML search
form (with a GET action
HTTP method) on site in
browser and hits Submit




          Client                                                     Query:”drills”

Request message
                                                                     Results:File”BD.html”
GET /search.html?query=drills HTTP/1.1
Host: www.csie.ncnu.edu.tw               www.csie.ncnu.edu.tw
Accept: *
User-agent: ShopBot
                                           Response message                                  Search gateway
                                           HTTP/1.1 200 OK
                                           Content-type: text/html
                                           Content-length: 1037

                                           <HTML>
                                           <HEAD><TITLE>Search Results</TITLE>
                                           […]
                                                                                                       32
Reference (HW#4)
 paper reading: “searching the Web”
 paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.
http://www.searchtools.com
  Search Tools for Web Sites and Intranets-resources for search tools and
  robots.
http://www.robotstxt.org/wc/robots.html
  The Web Robots Pages-resources for robot developers, including the
  registry of Internet Robots.
http://www.searchengineworld.com
  Search Engine World-resource for search engines and robots.
http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm
 RobotRules Perl source.
http://www.conman.org/people/spc/robots2.html
 An Extended Standard for Robot Exclusion.
Managing Gigabytes: Compressing and Indexing Documents and Images
  Written, I., Moffat, A., and Bell, T., Morgan Kaufmann.                     33

More Related Content

Viewers also liked

R&amp;b history
R&amp;b historyR&amp;b history
R&amp;b history
AS Media Column C
 
How to create favicon
How to   create    faviconHow to   create    favicon
How to create favicon
OM Maurya
 
how to create a blog on wordpress
how to create  a blog  on  wordpress how to create  a blog  on  wordpress
how to create a blog on wordpress
OM Maurya
 
How to create rss feed for your website
How to create  rss feed  for  your  websiteHow to create  rss feed  for  your  website
How to create rss feed for your website
OM Maurya
 
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Robert M Chapple
 
How to create rss feed
How to create rss feedHow to create rss feed
How to create rss feed
Tanuja Talekar
 
How to track website visitors using Google analytics
How to track website visitors using Google analyticsHow to track website visitors using Google analytics
How to track website visitors using Google analytics
Tanuja Talekar
 
how to setup Google analytics tracking code for website
how to setup  Google analytics tracking code for websitehow to setup  Google analytics tracking code for website
how to setup Google analytics tracking code for website
OM Maurya
 
How to create sitemap for website
How to create sitemap for websiteHow to create sitemap for website
How to create sitemap for website
OM Maurya
 
Evareporte
EvareporteEvareporte
Evareporte
edith maigua
 

Viewers also liked (10)

R&amp;b history
R&amp;b historyR&amp;b history
R&amp;b history
 
How to create favicon
How to   create    faviconHow to   create    favicon
How to create favicon
 
how to create a blog on wordpress
how to create  a blog  on  wordpress how to create  a blog  on  wordpress
how to create a blog on wordpress
 
How to create rss feed for your website
How to create  rss feed  for  your  websiteHow to create  rss feed  for  your  website
How to create rss feed for your website
 
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
 
How to create rss feed
How to create rss feedHow to create rss feed
How to create rss feed
 
How to track website visitors using Google analytics
How to track website visitors using Google analyticsHow to track website visitors using Google analytics
How to track website visitors using Google analytics
 
how to setup Google analytics tracking code for website
how to setup  Google analytics tracking code for websitehow to setup  Google analytics tracking code for website
how to setup Google analytics tracking code for website
 
How to create sitemap for website
How to create sitemap for websiteHow to create sitemap for website
How to create sitemap for website
 
Evareporte
EvareporteEvareporte
Evareporte
 

Similar to Introduction to "robots.txt

Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development Presentation
TurnToTech
 
HTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityHTML5 Real-Time and Connectivity
HTML5 Real-Time and Connectivity
Peter Lubbers
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
Randy Connolly
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud Developers
Brian Huff
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills security
John Varghese
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
Jessica Dunbar
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
Tom Johnson
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
mynameismrslide
 
Browser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyBrowser Internals-Same Origin Policy
Browser Internals-Same Origin Policy
Krishna T
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
Orel Fligelman
 
Webbasics
WebbasicsWebbasics
Webbasics
patinijava
 
improve website performance
improve website performanceimprove website performance
improve website performance
amit Sinha
 
Web development using ASP.NET MVC
Web development using ASP.NET MVC Web development using ASP.NET MVC
Web development using ASP.NET MVC
Adil Mughal
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
Henry S
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
Phase2
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
CJ Jenkins
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rank
WeLoveSEO
 
Kotlin server side frameworks
Kotlin server side frameworksKotlin server side frameworks
Kotlin server side frameworks
Ken Yee
 
From ZERO to REST in an hour
From ZERO to REST in an hour From ZERO to REST in an hour
From ZERO to REST in an hour
Cisco DevNet
 
Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)
DSBW 2011/2002 - Carles Farré - Barcelona Tech
 

Similar to Introduction to "robots.txt (20)

Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development Presentation
 
HTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityHTML5 Real-Time and Connectivity
HTML5 Real-Time and Connectivity
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud Developers
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills security
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Browser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyBrowser Internals-Same Origin Policy
Browser Internals-Same Origin Policy
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Webbasics
WebbasicsWebbasics
Webbasics
 
improve website performance
improve website performanceimprove website performance
improve website performance
 
Web development using ASP.NET MVC
Web development using ASP.NET MVC Web development using ASP.NET MVC
Web development using ASP.NET MVC
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rank
 
Kotlin server side frameworks
Kotlin server side frameworksKotlin server side frameworks
Kotlin server side frameworks
 
From ZERO to REST in an hour
From ZERO to REST in an hour From ZERO to REST in an hour
From ZERO to REST in an hour
 
Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)
 

More from Ishan Mishra

Political Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignPolitical Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaign
Ishan Mishra
 
Social Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreSocial Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in Indore
Ishan Mishra
 
Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020
Ishan Mishra
 
SEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreSEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company Indore
Ishan Mishra
 
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
Ishan Mishra
 
Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015
Ishan Mishra
 
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Ishan Mishra
 
AdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueAdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad Revenue
Ishan Mishra
 
Online Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveOnline Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of Trave
Ishan Mishra
 
Management lesson from Mahabharat
Management lesson from MahabharatManagement lesson from Mahabharat
Management lesson from Mahabharat
Ishan Mishra
 
Atif Aslam's Biography
Atif Aslam's BiographyAtif Aslam's Biography
Atif Aslam's Biography
Ishan Mishra
 
Inbound Marketing Agency India | ISHAN-Tech
Inbound Marketing Agency India  | ISHAN-TechInbound Marketing Agency India  | ISHAN-Tech
Inbound Marketing Agency India | ISHAN-Tech
Ishan Mishra
 
Crystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesCrystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompanies
Ishan Mishra
 
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 Global Management Consulting, Technology and Outsourcing Services from ISHAN... Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Ishan Mishra
 
ISHAN-TECH Consulting
ISHAN-TECH ConsultingISHAN-TECH Consulting
ISHAN-TECH Consulting
Ishan Mishra
 
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Ishan Mishra
 

More from Ishan Mishra (16)

Political Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignPolitical Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaign
 
Social Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreSocial Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in Indore
 
Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020
 
SEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreSEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company Indore
 
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
 
Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015
 
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India
 
AdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueAdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad Revenue
 
Online Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveOnline Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of Trave
 
Management lesson from Mahabharat
Management lesson from MahabharatManagement lesson from Mahabharat
Management lesson from Mahabharat
 
Atif Aslam's Biography
Atif Aslam's BiographyAtif Aslam's Biography
Atif Aslam's Biography
 
Inbound Marketing Agency India | ISHAN-Tech
Inbound Marketing Agency India  | ISHAN-TechInbound Marketing Agency India  | ISHAN-Tech
Inbound Marketing Agency India | ISHAN-Tech
 
Crystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesCrystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompanies
 
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 Global Management Consulting, Technology and Outsourcing Services from ISHAN... Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 
ISHAN-TECH Consulting
ISHAN-TECH ConsultingISHAN-TECH Consulting
ISHAN-TECH Consulting
 
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
 

Recently uploaded

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 

Recently uploaded (20)

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 

Introduction to "robots.txt

  • 1. Web Robots ISHAN MISHRA www.IshanTech.org 1
  • 2. Outline  Robot applications  How it works  Cycle Avoidance 2
  • 3. Applications  Behavior of web robots  Wander from web site to site (recursively),  1. Fetching content,  2. Following hyperlinks,  3. Process the data they find.  Colorful names  Crawlers,  Spiders,  Worms,  Bots 3
  • 4. Where to Start: The “Root Set” A G L S B C D M N T U H I J O E F K P Q R 4
  • 5. Cycle Avoidance A B E B E B E AB A C A C A ABC C D D D (a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and follows link, fetches B and fetches page C is back to A 5
  • 6. Loops  Cycles are bad for crawlers for there reasons.  Spending robot’s time and space  Overwhelm the web site.  Duplicate content. 6
  • 7. Data structure for robot  Trees and hash table  Lossy presence bit maps  Checkpoints  Save the list of visited URL to disk, in case the robot crashes  Partitioning  Robot farms 7
  • 8. Canonicalizing URLs  Most web robots try to eliminate the obvious aliases by “canonicalizing” URL into a standard form, by:  adding “:80” to the hostname, if the port isn’t specified.  Converting all %xx escaped characters into their character equivalents.  Removing # tags 8
  • 9. Symbolic link cycles / / index.html subdir index.html subdir index.html logo.gif (a) subdir is a directory (b) subdir is an upward symbolic link 9
  • 10. Dynamic Virtual Web Spaces  It can be possible to publish a URL that looks like a normal file but really is a gateway application.  This application can generate HTML on the fly that contains links to imaginary URLs on the same server. When these imaginary URLs are requested, new imaginary URLs are generated.  Such kind of malicious web server take the poor robot on an Alice-in-Wonderland journey through an infinite virtual space, even if the web server doesn’t really contain any files. Sometimes the robot is hard to detect this trap, because HTML and URLs may look very different all the time.  For example, a CGI-based calendaring program 10
  • 11. Malicious dynamic web space example 11
  • 12. Techniques for avoiding loops  Canonicalizing URLs  Breath-first crawling  Throttling  Limit the number of pages the robot can fetch from a web site in a period of time.  Limit URL size  Avoid symbolic cycle problem.  Problem: many sites use URLs to maintain user state.  URL/site blacklist  vs. “excluding Robot” 12
  • 13. Techniques for avoiding loops  Pattern detection  e.g., “subdir/subdir/subdir…”  e.g., “subdir/images/subdir/images/subdir/…”  Content fingerprinting  A checksum concept, while the odds of two different pages having the same check sum are small.  Message digest functions such as MD5 are popular for this purpose.  Human monitoring  Should design your robot with diagnostics and logging, so human beings can easily monitor the robot’s process and be warned quickly if something unusual is happening. 13
  • 14. Robotic HTTP  No different from any other HTTP client program.  Many robots try to implement the minimum amount of HTTP needed to request the content they seek.  It is recommended that robot implementers send some basic header information to notify the site of the capabilities of the robot, the robot identify, and where it originated. 14
  • 15. Identifying Request Header  User-Agent  Tell the server the robot’s name  From  Tell the email of the robot’s user/admin email.  Accept  Tell the server what media types are okay to send. (e.g. only fetch text and sound).  Referer  Tell the server how a robot found links to this site’s content. 15
  • 16. Virtual docroots cause trouble if no Host header is sent Robot tries to request index.html from www.csie.ncnu.edu.tw, but does Servers is configured to serve not include a Host header. both sites, but serves www.ncnu.edu.tw by default. Web robot client Request message GET /index.html HTTP/1.0 User-agent: ShopBot 1.0 www.ncnu.edu.tw www.csie.ncnu.edu.tw Response message HTTP/1.0 200 OK […] <HTML> <TITLE>National Chi Nan University</TITLE> […] 16
  • 17. What else a robot should support  Support Virtual Hosting  Not including this can lead to robots identifying the wrong content with a particular URL.  Conditional Requests  To minimize the amount of content retrieved, by conditional HTTP requests. (like cache revalidation)  Response Handling  Status code: 200 OK, 404 Not Found, 304  Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>  User-Agent Targeting  Web master should keep in mind that many robot will visit their site. Many sites optimize content for various user agents (I.E. or netscape).  Problem: “your browser does not support frame.” 17
  • 18. Misbehaving Robots  Runaway robot  Robots issue HTTP requests as fast as they can.  Stale URLs  Robots visit the old lists of URLs.  Long, wrong URLs  May reduce web server’s performance, clutter server’s access logs, even crash server.  Nosy robots  Some robots may get URLs that point to private data and make that data easily accessible through search engine.  Dynamic gateway access  Robots don’t always know what they are accessing. 18
  • 19. Excluding Robots www.ncnu.edu.tw Robot parses the robots.txt file and determines if it is allowed to access the acetylene-torches.html file. It is, so it proceeds with the request. 19
  • 20. robots.txt format  #allow google, csiebot to crawl the public parts of our site, but no other robots are allowed to crawl anything of our sites  User-Agent: googlebot  User-Agent: csiebot  Disallow: /private  User-Agent: *  Disallow: 20
  • 21. Robots Exclusion Standard versions Version Title and description Date 0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994 original robot.txt mechanism with Disallow directive 1.0 A Method for Web Robots Control-Martijn Nov. 1996 Koster’s IETF draft with additional support for Allow 2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996 Conner’s extension including regex and timing information; not widely supported 21
  • 22. Robots.txt path matching examples Rule path URL path Match? Comments /tmp /tmp ˇ Rule path==URL path /tmp /tmpfile.html ˇ Rule path is a prefix of URL path /tmp /tmp/a.html ˇ Rule path is a prefix of URL path /tmp/ /tmp x /tmp/ is not a prefix of /tmp README.TXT ˇ Empty rule path matches everything /~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~ /%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~ /%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes /~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a special case that must match exactly 22
  • 23. HTML Robot-control Meta Tags  e.g.  <META NAME=“ROBOTS” CONTENT=directive-list>  Directive-list  NOINDEX  Not to process this document content  NOFOLLOW  Not to crawl any outgoing links from this page  INDEX  FOLLOW  NOARCHIVE  Should not cache a local copy of the page  ALL (equivalent to INDEX, FOLLOW)  NONE (equivalent to NOINDEX, NOFOLLOW) 23
  • 24. Additional META tag directives name= content= Description DESCRIPTION <text> Allows an author to define a short text summary of the web page. Many search engines look at META DESCROPTION tags,allowing page author to specify appropriate short abstracts to describe their web pages. <meta name=“description” content=“Welcome to Mary’s Antiques web site”> KEYWORDS <comma Associates a comma-separated list of words that describes the list> web page, to assist in keyword searches. <meta name=“keywords” content=“antiques,mary,furniture,restoration”> REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be revisited, presumably because it is subject to change, after the specified number of days. <meta name=“revisit-after” content=“10 days”> * This directive is not likely to have wide support. 24
  • 25. Guidelines for web robot operators (Robot Etiquette) 25
  • 26. Guidelines for web robot operators (cont.) 26
  • 27. Guidelines for web robot operators (cont.) 27
  • 28. Guidelines for web robot operators (cont.) 28
  • 29. Guidelines for web robot operators (cont.) 29
  • 30. Modern Search Engine Architecture User Web server User Web server Web search Search engine gateway crawler/indexer User Full-text index database Web server User Web search users Query engine Crawling and indexing 30
  • 32. Posting the Query User fills out HTML search form (with a GET action HTTP method) on site in browser and hits Submit Client Query:”drills” Request message Results:File”BD.html” GET /search.html?query=drills HTTP/1.1 Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw Accept: * User-agent: ShopBot Response message Search gateway HTTP/1.1 200 OK Content-type: text/html Content-length: 1037 <HTML> <HEAD><TITLE>Search Results</TITLE> […] 32
  • 33. Reference (HW#4)  paper reading: “searching the Web”  paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001. http://www.searchtools.com Search Tools for Web Sites and Intranets-resources for search tools and robots. http://www.robotstxt.org/wc/robots.html The Web Robots Pages-resources for robot developers, including the registry of Internet Robots. http://www.searchengineworld.com Search Engine World-resource for search engines and robots. http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm RobotRules Perl source. http://www.conman.org/people/spc/robots2.html An Extended Standard for Robot Exclusion. Managing Gigabytes: Compressing and Indexing Documents and Images Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33