Developing apps for humans & robots


This slide deck speaks about robot.txt and website crawling by web robots.

  1. 1. Developing Web Applications for Humans and Robots --- Nagaraju Sangam
  2. 2. Humans: Humans has Feelings Habits Languages •Char Encoding •Left-to-right Vs Right to Left Cultures Time Zones User roles: Admin, End User Impairments : Visual, Hear, Motor, Cognitive
  3. 3. Humans •alt, title for image •Keep empty alt for unimportant images •role for sections •for (label –field) •Titles for frames •Allow keyboard navigation
  4. 4. Web Robots: Web Robots : Programs that traverse the Web automatically.  Web Wanderers  Crawlers  Spiders Good Robots : indexing/crawling Eg: •Googlebot •Bingbot •Msnbot Bad Robots: Spam : Tries to read confidential info from the pages, access private folders… Email ids, Phone numbers etc.
  5. 5. Problems with Good Robots: Crawls everything… Scripts CSS Resources Images Multiple versions of the pages Un-related pages Private folders etc…
  6. 6. Problems with Good Robots: Solution  Add Robots.txt file in root folder of your site  You should be able to browse the file via below URL http://yourdomain/robots.txt  Put the below code in robots.txt This will prevent all bots from crawling your site… User-Agent:* Disallow: / Robots.txt
  7. 7. Problems with Good Robots: Solution Robots.txt User-agent: Googlebot Disallow: /scripts Disallow: /styles Disallow: /*.PDF$ User-agent: Bingbot Disallow: /scripts Disallow: /styles Disallow: /*.PDF$ User-agent: Yandex Disallow: /scripts Disallow: /styles Disallow: /*.PDF$ User-agent: * Disallow: / Robots.txt
  8. 8. Dealing with Bad Robots: Robots.txt is not a real security feature. It doesn’t prevent the bad robots from crawling your content. It’s just a guideline for the robots, its up to them whether to follow it or not. For bad robots you should have rules setup in firewalls to block them.
  9. 9. Typo errors in Robots.txt: Robots.txt is a case sensitive file. There is a possibility for typo errors.  So it’s always advisable to use tools to generate the file.
  10. 10. Samples:   
  11. 11. Online tools to create robots.txt 
  12. 12. Meta tags for Robots: We can setup rules for robots at the html page level via html tags Meta tags <META name="robots" content= "NOINDEX, NOFOLLOW"> <Meta name="googlebot" content="noindex" /> <Meta name="googlebot-news" content="nosnippet"> HTTP Headers X-Robots-Tag: noindex If you have Robots.txt and meta tags in page, search engines will first look at the robots.txt and then the meta tags in the page. Meta tag attribute values are case in-sensitive, Robots.txt is case sensitive.
  13. 13. Meta tag values for search engines:
  14. 14. Other html tags for used by web robots: <Title> <META NAME=“DESCRIPTION" CONTENT=“Nagaraju Sangam"> <META NAME="AUTHOR" CONTENT=“Nagaraju Sangam"> <META HTTP-EQUIV="CONTENT-LANGUAGE" CONTENT="en-US,fr"> <META HTTP-EQUIV="EXPIRES" CONTENT="Sun, 30 May 2013 12:00:00PM GMT"> <META NAME="KEYWORDS" CONTENT=“music,news,entertinement">
  15. 15. Title & Description in search results: Title: Comes from the <Title> tag in the head section of the page. If no title is found, search engine performs the heuristic algorithm and displays the title. Description: Comes from the Meta tag in the head section of the page. If no description is found is found, search engine performs the heuristic algorithm and displays the description, this may not be intuitive to the page. <Meta name=“description” content=“description goes here..”> It’s a best practice to add title and description to each page of the site. Title should be unique for each page.
  17. 17. References Google is the best place to search , use the below terms •Web SEO •Web Accessibility
