The robots.txt file informs search engine bots how to crawl and index a website. It is a plain text file placed in the root directory of the site with the URL www.example.com/robots.txt. It allows website owners to block certain pages like login pages, search results, and CSS files from being indexed while still allowing good bots to crawl other parts of the site. Tools like the Google Search Console and SEObook provide robots.txt generators and analyzers to help users create and check their robots.txt files for errors.
1. Robots.txt File
What is robots.txt? ๏ท The robots.txt is a simple text file in your web site that inform search engine bots how to crawl and index website or web pages. ๏ท By default search engine bots crawl everything possible unless they are forbidden from doing so. They always scan the robots.txt file before crawling the web site. ๏ท Declaring a robots.txt means that visitors (bots) are not allowed to index sensitive data but it doesnโt mean that they canโt. The legal/good bots follow what is instructed to them but the Malware robots donโt care about it, so donโt try to use it as a security for your web site. How to build a robots.txt file (Terms, Structure & Placement)? The terms used in robots.txt and their meanings are given in tabular format.
The robots.txt is usually placed in the root folder of your web site so that the URL of your robots.txt file resembles www.example.com/robots.txt in the web browser. Remember that you use all the lower case letter for the filename.
2. Robots.txt File
You can define different restrictions to different bots by applying bot specific rules but be aware that the more you make it complicated; it becomes harder for you to understand its traps. Always specify bot specific rules before specifying common rules so that bots read the file till the end to find rules specific to their names or else follow common rules. You can check our many other sites robots.txt to get a feel on how these are generally implemented. http://www.searchenabler.com/robots.txt http://www.google.com/robots.txt http://searchengineland.com/robots.txt Example scenarios for robots.txt If you have a close look at Search Enabler robots.txt, you can notice that we have blocked following pages from search indexing. You can analyze which pages and links should be blocked from your website. On a general note we advice hiding pages such as search results page within your web site and user logins, profiles, logs and styling CSS sheets. 1. Disallow: /?s= It is a dynamic search results page and there is no point in indexing it which will create duplicate content problems. 2. Disallow: /blog/2010/ These are the blogs categorized in a year wise patterns and are blocked because they lead to duplication errors with different URLs pointing to the same web page. 3. Disallow: /login/ It is a login page meant only for users of searchenabler tool so it is blocked from getting crawled. How does robots.txt affect search results? By using the robots.txt file, you can hide the pages such as user profiles and other temp folders from being indexed and does not divulge your SEO effort into junk or the pages which are useless for the search results. In general, you results will be more precise and better valued.
3. Robots.txt File
Default Robots.txt Default Robots.txt file basically tells every crawler that it is allowed any web site directory to its heart content: User-agent: * Disallow: (which translates as โdisallow nothingโ) The often asked question here is why to use it at all. Well, it is not required but recommended to use for the simple reason that search bots will request it anyway (this means youโll see 404 errors in your log files from bots requesting your non-existent Robots.txt page). Besides, having a default Robots.txt will ensure there wonโt be any misunderstandings between your site and a crawler. Robots.txt Blocking Specific Folders / Content: The most common usage of Robots.txt is to ban crawlers from visiting private folders or content that gives them no additional information. This is done primarily in order to save the crawlerโs time: bots crawl on a budget โ if you ensure that it doesnโt waste time on unnecessary content, it will crawl your site deeper and quicker. Samples of Robots.txt files blocking specific content (note: I highlighted only a few most basic cases): User-agent: * Disallow: /database/ (blocks all crawlers from /database/ folder ) User-agent: * Disallow: /*? (blocks all crawlers from all URLโs containing ? ) User-agent: * Disallow: /navy/ Allow: /navy/about.html (blocks all crawlers from /navy/ folder but allow access to one page from this folder) Note from John Mueller commenting below: The โAllow:โ statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)
4. Robots.txt File
Robots.txt Allowing Access to Specific Crawlers Some people choose to save bandwidth and allow access to only those crawlers they care about (e.g. Google, Yahoo and MSN). In this case, Robots.txt file should list those Robots followed by the command itself, etc: User-agent: * Disallow: / User-agent: googlebot Disallow: User-agent: slurp Disallow: User-agent: msnbot Disallow: (the first part blocks all crawlers from everything, while the following 3 blocks list those 3 crawlers that are allowed to access the whole site) Need Advanced Robots.txt Usage? I tend to recommend people to refrain from doing anything too tricky in their Robots.txt file unless they are 100% knowledgeable in the topic. Messed-up Robots.txt file can result in screwed project launch. Many people spend weeks and months trying to figure why there site is ignored by crawlers until they realize (often with some external help) that they have misused their Robots.txt file. The better solution for controlling crawler activity might be to get away with on-page solutions (robots meta tags). Aaron did a great job summing up the difference in his guide(bottom of the page).
5. Robots.txt File
Best Robots.txt Tools: Generators and Analyzers While I do not encourage anyone to rely too much on Robots.txt tools (you should either make your best to understand the syntax yourself or turn to an experienced consultant to avoid any issues), the Robots.txt generators and checkers I am listing below will hopefully be ofadditional help: Robots.txt generators: Common procedure: 1. choose default / global commands (e.g. allow/disallow all robots); 2. choose files or directories blocked for all robots; 3. choose user-agent specific commands: 1. choose action; 2. choose a specific robot to be blocked. As a general rule of thumb, I donโt recommend using Robots.txt generators for the simple reason: donโt create any advanced (i.e. non default) Robots.txt file until you are 100% sure you understand what you are blocking with it. But still I am listing two most trustworthy generators to check: ๏ท Google Webmaster tools: Robots.txt generator allows to create simple Robots.txt files. What I like most about this tool is that it automatically adds all global commands to each specific user agent commands (helping thus to avoid one of the most common mistakes):
๏ท SEObook Robots.txt generator unfortunately misses the above feature but it is really easy (and fun) to use:
6. Robots.txt File
Robots.txt checkers: ๏ท Google Webmaster tools: Robots.txt analyzer โtranslatesโ what your Robots.txt dictates to the Googlebot:
๏ท Robots.txt Syntax Checker finds some common errors within your file by checking for whitespace separated lists, not widely supported standards, wildcard usage, etc. ๏ท A Validator for Robots.txt Files also checks for syntax errors and confirms correct directory paths.