Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What I learned from
analysing thousands
of robots.txt files
samgipson
# brightonSEO
2020
Of all things...why robots.txt?
samgipson
# brightonSEO
2nd July 2019.
samgipson
# brightonSEO
samgipson
# brightonSEO
Google
Webmasters
blog post
samgipson
# brightonSEO
Robots
Exclusion
Checker
samgipson
# brightonSEO
How many top performing
sites still use unsupported
or incorrect rules?
What are the most
common mistakes within
robots.txt?
samgipson
# brightonSEO
Robots.txt: The history
samgipson
# brightonSEO
Based on Robots
Exclusion
Protocol (REP)
samgipson
# brightonSEO
Millions of sites use a robots.txt file
samgipson
# brightonSEO
Despite not an
official internet
standard
samgipson
# brightonSEO
samgipson
# brightonSEO
Control the content
crawlers can and
can’t access
It’s hugely
powerful.
Mistakes can
cost you big.
samgipson
# brightonSEO
Did you guess the year?
samgipson
# brightonSEO
1994!
samgipson
# brightonSEO
In 2019 Google submitted a
revised REP draft to try to make it
an official standard
samgipson
# brightonSEO
FACT
Robots.txt: The basics
samgipson
# brightonSEO
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
{field}
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
samgipson
# brightonSEO
User-agent: googlebot
Disallow: /checkout/
samgipson
# brightonSEO
User-agent: googlebot
Disallow: /checkout/
{value}
samgipson
# brightonSEO
User-agent: *
Allow:Dis /checkout/
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
{directive or rule}
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
{path}
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
{group}
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
User-agent: googlebot
Disallow: /checkout/
Disallow: /basket/
{...
robots.txt controls crawling
not indexation
samgipson
# brightonSEO
FACT
samgipson
# brightonSEO
Here’s where it got confusing...
samgipson
# brightonSEO
Google used to support
unofficial directives
samgipson
# brightonSEO
samgipson
# brightonSEO
HTML <head>
<meta name=”robots”
content=”noindex, nofollow”>
samgipson
# brightonSEO
HTTP header
X-Robots-Tag: googlebot:
noindex, nofollow
samgipson
# brightonSEO
robots.txt
User-agent: googlebot
Noindex: /checkout/
Nofollow: /checkout/
Webmasters / SEOs realised that
noindex: worked
samgipson
# brightonSEO
samgipson
# brightonSEO
Google
Webmasters
blog post
My analysis
samgipson
# brightonSEO
STEP ONE
samgipson
# brightonSEO
Identified top
traffic driving
sites across a
range of sectors
samgipson
# brightonSEO
Automotive
Computing
Cooking/Recipes
Electronics
Fashion
Gambling
Hardware
samgipson
# brightonSEO
Health/Medical
Insurance
Jobs
News
Real Estate
Telecoms
Travel
samgipson
# brightonSEO
STEP TWO
samgipson
# brightonSEO
Extracted the
robots.txt files
for 40,000
unique domains
samgipson
# brightonSEO
STEP THREE
samgipson
# brightonSEO
Noindex:
Nofollow:
Crawl-delay:
samgipson
# brightonSEO
<field>
<value>
<directive>
<path>
samgipson
# brightonSEO
Results: Unsupported rules
samgipson
# brightonSEO
samgipson
# brightonSEO
Out of the 40,000 site analysed,
0.5% used unsupported rules
Nofollow:
samgipson
# brightonSEO
1 Gambling
40,000 domains analysed
Crawl-delay:
samgipson
# brightonSEO
2,600
40,000 domains analysed
Crawl-delay:
samgipson
# brightonSEO
Real Estate
Hardware/DIY
Fashion
2,600
40,000 domains analysed
Noindex:
samgipson
# brightonSEO
220
40,000 domains analysed
Noindex:
samgipson
# brightonSEO
220
Retail
Finance
Jobs
Health
40,000 domains analysed
Brands using outdated rules
samgipson
# brightonSEO
Results: Basic Mistakes
samgipson
# brightonSEO
Issue 1
samgipson
# brightonSEO
<field> name spelt
incorrectly
samgipson
# brightonSEO
<field> name is case
insensitive
FACT
This is ok:
samgipson
# brightonSEO
User-Agent
user-agent
USER-AGENT
UsEr-AgEnt
This ISN’T:
samgipson
# brightonSEO
useragent
user agent
er-agent
ser-agent
user-agnet
<field> name errors
samgipson
# brightonSEO
Telecoms30
40,000 domains analysed
Issue 2
samgipson
# brightonSEO
Incorrect user-agent
<value>
samgipson
# brightonSEO
User-agent <value> is
case insensitive
FACT
This is ok:
samgipson
# brightonSEO
Googlebot
googlebot
GOOGLEBOT
Bingbot
bingbot
This is a grey area:
samgipson
# brightonSEO
Googlebotrandomtext
Google bot
goglebot
Google
Issue 3
samgipson
# brightonSEO
Incorrect directives
samgipson
# brightonSEO
<directives> are case
insensitive
FACT
This is ok:
samgipson
# brightonSEO
allow:
ALLOW:
Allow:
disallow:
DISALLOW:
Disallow:
This ISN’T:
samgipson
# brightonSEO
dissalow:
dissallow:
disallo:
disalow:
allw:
<directive> errors
samgipson
# brightonSEO
All18
40,000 domains analysed
Issue 3
samgipson
# brightonSEO
Invalid <path>
format
samgipson
# brightonSEO
URL <path> should start
with a /
FACT
This is ok:
samgipson
# brightonSEO
Disallow: /checkout/
Disallow: /*?delivery_type
Disallow: *?delivery_type
This ISN’T:
samgipson
# brightonSEO
Disallow: .js
Disallow: .css
Disallow: WebResource.axd
Disallow: ScriptResource.axd
Di...
Incorrect <path>
samgipson
# brightonSEO
Equal spread
across sectors
231
40,000 domains analysed
Brands using incorrect <path>
samgipson
# brightonSEO
samgipson
# brightonSEO
URL <path> IS case
sensitive
FACT
Additional takeaways
samgipson
# brightonSEO
samgipson
# brightonSEO
A specific user-agent
overrules a catchall
FACT
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
Disallow: /*?delivery_type
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
Disallow: /*?delivery_type
User-agent: googlebot
Disallow: /ano...
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
Disallow: /*?delivery_type
User-agent: googlebot
Disallow: /ano...
samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/
Disallow: /*?delivery_type
User-agent: googlebot
Disallow: /che...
samgipson
# brightonSEO
The order of
<directives> doesn’t
matter for most bots
FACT
samgipson
# brightonSEO
Specificity (length) of
the matching rule wins
FACT
samgipson
# brightonSEO
https://example.com/page
disallow: /
allow: /p
samgipson
# brightonSEO
https://example.com/page
disallow: /
allow: /p
samgipson
# brightonSEO
https://example.com/page
disallow: /
allow: /p
samgipson
# brightonSEO
Conflict?
Least restrictive WINS
samgipson
# brightonSEO
You can group
user-agents together
FACT
samgipson
# brightonSEO
User-agent: googlebot
Disallow: /checkout/
Disallow: /*?delivery_type
User-agent: bingbot
Disallow...
samgipson
# brightonSEO
User-agent: googlebot
User-agent: bingbot
Disallow: /checkout/
Disallow: /*?delivery_type
Summary
samgipson
# brightonSEO
samgipson
# brightonSEO
Google are pushing for REP to
become an Internet standard
samgipson
# brightonSEO
We should all be pushing for a
best practice robots.txt
samgipson
# brightonSEO
Avoid Google having to make
allowances for inaccuracies
samgipson
# brightonSEO
Who knows…
they may suddenly stop
samgipson
# brightonSEO
Get the basics right. Many big
brands aren’t.
samgipson
# brightonSEO
Test.
samgipson
# brightonSEO
Dig deeper.
samgipson
# brightonSEO
Nail it.
Further reading/resources
samgipson
# brightonSEO
samgipson
# brightonSEO
Tools Articles
Chrome Extension: Robots Exclusion Checker
samgipson.com/robots/
ContentKing: Robot...
Thank you.
samgipson
# brightonSEO
@samgipson
samgipson
samgipson.com
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

What I learned from analysing thousands of robots.txt files | BrightonSEO 2020

Download to read offline

A presentation by Sam Gipson detailing the common mistakes found when analysing 40,000 robots.txt files across some of the top-performing sites on the web. This talk also covers changes to the Robots Exclusion Protocol (REP), some useful tips for managing a robots.txt and features a very useful Chrome Extension (Robots Exclusion Checker) that inspired this analysis.

Related Books

Free with a 30 day trial from Scribd

See all

What I learned from analysing thousands of robots.txt files | BrightonSEO 2020

  1. 1. What I learned from analysing thousands of robots.txt files samgipson # brightonSEO 2020
  2. 2. Of all things...why robots.txt? samgipson # brightonSEO
  3. 3. 2nd July 2019. samgipson # brightonSEO
  4. 4. samgipson # brightonSEO Google Webmasters blog post
  5. 5. samgipson # brightonSEO Robots Exclusion Checker
  6. 6. samgipson # brightonSEO How many top performing sites still use unsupported or incorrect rules?
  7. 7. What are the most common mistakes within robots.txt? samgipson # brightonSEO
  8. 8. Robots.txt: The history samgipson # brightonSEO
  9. 9. Based on Robots Exclusion Protocol (REP) samgipson # brightonSEO
  10. 10. Millions of sites use a robots.txt file samgipson # brightonSEO
  11. 11. Despite not an official internet standard samgipson # brightonSEO
  12. 12. samgipson # brightonSEO Control the content crawlers can and can’t access
  13. 13. It’s hugely powerful. Mistakes can cost you big. samgipson # brightonSEO
  14. 14. Did you guess the year? samgipson # brightonSEO
  15. 15. 1994! samgipson # brightonSEO
  16. 16. In 2019 Google submitted a revised REP draft to try to make it an official standard samgipson # brightonSEO FACT
  17. 17. Robots.txt: The basics samgipson # brightonSEO
  18. 18. samgipson # brightonSEO User-agent: * Disallow: /checkout/
  19. 19. samgipson # brightonSEO User-agent: * Disallow: /checkout/ {field}
  20. 20. samgipson # brightonSEO User-agent: * Disallow: /checkout/
  21. 21. samgipson # brightonSEO User-agent: googlebot Disallow: /checkout/
  22. 22. samgipson # brightonSEO User-agent: googlebot Disallow: /checkout/ {value}
  23. 23. samgipson # brightonSEO User-agent: * Allow:Dis /checkout/
  24. 24. samgipson # brightonSEO User-agent: * Disallow: /checkout/
  25. 25. samgipson # brightonSEO User-agent: * Disallow: /checkout/ {directive or rule}
  26. 26. samgipson # brightonSEO User-agent: * Disallow: /checkout/ {path}
  27. 27. samgipson # brightonSEO User-agent: * Disallow: /checkout/ {group}
  28. 28. samgipson # brightonSEO User-agent: * Disallow: /checkout/ User-agent: googlebot Disallow: /checkout/ Disallow: /basket/ {group A} {group B}
  29. 29. robots.txt controls crawling not indexation samgipson # brightonSEO FACT
  30. 30. samgipson # brightonSEO
  31. 31. Here’s where it got confusing... samgipson # brightonSEO
  32. 32. Google used to support unofficial directives samgipson # brightonSEO
  33. 33. samgipson # brightonSEO HTML <head> <meta name=”robots” content=”noindex, nofollow”>
  34. 34. samgipson # brightonSEO HTTP header X-Robots-Tag: googlebot: noindex, nofollow
  35. 35. samgipson # brightonSEO robots.txt User-agent: googlebot Noindex: /checkout/ Nofollow: /checkout/
  36. 36. Webmasters / SEOs realised that noindex: worked samgipson # brightonSEO
  37. 37. samgipson # brightonSEO Google Webmasters blog post
  38. 38. My analysis samgipson # brightonSEO
  39. 39. STEP ONE samgipson # brightonSEO
  40. 40. Identified top traffic driving sites across a range of sectors samgipson # brightonSEO
  41. 41. Automotive Computing Cooking/Recipes Electronics Fashion Gambling Hardware samgipson # brightonSEO
  42. 42. Health/Medical Insurance Jobs News Real Estate Telecoms Travel samgipson # brightonSEO
  43. 43. STEP TWO samgipson # brightonSEO
  44. 44. Extracted the robots.txt files for 40,000 unique domains samgipson # brightonSEO
  45. 45. STEP THREE samgipson # brightonSEO
  46. 46. Noindex: Nofollow: Crawl-delay: samgipson # brightonSEO
  47. 47. <field> <value> <directive> <path> samgipson # brightonSEO
  48. 48. Results: Unsupported rules samgipson # brightonSEO
  49. 49. samgipson # brightonSEO Out of the 40,000 site analysed, 0.5% used unsupported rules
  50. 50. Nofollow: samgipson # brightonSEO 1 Gambling 40,000 domains analysed
  51. 51. Crawl-delay: samgipson # brightonSEO 2,600 40,000 domains analysed
  52. 52. Crawl-delay: samgipson # brightonSEO Real Estate Hardware/DIY Fashion 2,600 40,000 domains analysed
  53. 53. Noindex: samgipson # brightonSEO 220 40,000 domains analysed
  54. 54. Noindex: samgipson # brightonSEO 220 Retail Finance Jobs Health 40,000 domains analysed
  55. 55. Brands using outdated rules samgipson # brightonSEO
  56. 56. Results: Basic Mistakes samgipson # brightonSEO
  57. 57. Issue 1 samgipson # brightonSEO <field> name spelt incorrectly
  58. 58. samgipson # brightonSEO <field> name is case insensitive FACT
  59. 59. This is ok: samgipson # brightonSEO User-Agent user-agent USER-AGENT UsEr-AgEnt
  60. 60. This ISN’T: samgipson # brightonSEO useragent user agent er-agent ser-agent user-agnet
  61. 61. <field> name errors samgipson # brightonSEO Telecoms30 40,000 domains analysed
  62. 62. Issue 2 samgipson # brightonSEO Incorrect user-agent <value>
  63. 63. samgipson # brightonSEO User-agent <value> is case insensitive FACT
  64. 64. This is ok: samgipson # brightonSEO Googlebot googlebot GOOGLEBOT Bingbot bingbot
  65. 65. This is a grey area: samgipson # brightonSEO Googlebotrandomtext Google bot goglebot Google
  66. 66. Issue 3 samgipson # brightonSEO Incorrect directives
  67. 67. samgipson # brightonSEO <directives> are case insensitive FACT
  68. 68. This is ok: samgipson # brightonSEO allow: ALLOW: Allow: disallow: DISALLOW: Disallow:
  69. 69. This ISN’T: samgipson # brightonSEO dissalow: dissallow: disallo: disalow: allw:
  70. 70. <directive> errors samgipson # brightonSEO All18 40,000 domains analysed
  71. 71. Issue 3 samgipson # brightonSEO Invalid <path> format
  72. 72. samgipson # brightonSEO URL <path> should start with a / FACT
  73. 73. This is ok: samgipson # brightonSEO Disallow: /checkout/ Disallow: /*?delivery_type Disallow: *?delivery_type
  74. 74. This ISN’T: samgipson # brightonSEO Disallow: .js Disallow: .css Disallow: WebResource.axd Disallow: ScriptResource.axd Disallow: js/ Disallow: http://site.com/page
  75. 75. Incorrect <path> samgipson # brightonSEO Equal spread across sectors 231 40,000 domains analysed
  76. 76. Brands using incorrect <path> samgipson # brightonSEO
  77. 77. samgipson # brightonSEO URL <path> IS case sensitive FACT
  78. 78. Additional takeaways samgipson # brightonSEO
  79. 79. samgipson # brightonSEO A specific user-agent overrules a catchall FACT
  80. 80. samgipson # brightonSEO User-agent: * Disallow: /checkout/ Disallow: /*?delivery_type
  81. 81. samgipson # brightonSEO User-agent: * Disallow: /checkout/ Disallow: /*?delivery_type User-agent: googlebot Disallow: /another-folder/
  82. 82. samgipson # brightonSEO User-agent: * Disallow: /checkout/ Disallow: /*?delivery_type User-agent: googlebot Disallow: /another-folder/
  83. 83. samgipson # brightonSEO User-agent: * Disallow: /checkout/ Disallow: /*?delivery_type User-agent: googlebot Disallow: /checkout/ Disallow: /*?delivery_type Disallow: /another-folder/
  84. 84. samgipson # brightonSEO The order of <directives> doesn’t matter for most bots FACT
  85. 85. samgipson # brightonSEO Specificity (length) of the matching rule wins FACT
  86. 86. samgipson # brightonSEO https://example.com/page disallow: / allow: /p
  87. 87. samgipson # brightonSEO https://example.com/page disallow: / allow: /p
  88. 88. samgipson # brightonSEO https://example.com/page disallow: / allow: /p
  89. 89. samgipson # brightonSEO Conflict? Least restrictive WINS
  90. 90. samgipson # brightonSEO You can group user-agents together FACT
  91. 91. samgipson # brightonSEO User-agent: googlebot Disallow: /checkout/ Disallow: /*?delivery_type User-agent: bingbot Disallow: /checkout/ Disallow: /*?delivery_type
  92. 92. samgipson # brightonSEO User-agent: googlebot User-agent: bingbot Disallow: /checkout/ Disallow: /*?delivery_type
  93. 93. Summary samgipson # brightonSEO
  94. 94. samgipson # brightonSEO Google are pushing for REP to become an Internet standard
  95. 95. samgipson # brightonSEO We should all be pushing for a best practice robots.txt
  96. 96. samgipson # brightonSEO Avoid Google having to make allowances for inaccuracies
  97. 97. samgipson # brightonSEO Who knows… they may suddenly stop
  98. 98. samgipson # brightonSEO Get the basics right. Many big brands aren’t.
  99. 99. samgipson # brightonSEO Test.
  100. 100. samgipson # brightonSEO Dig deeper.
  101. 101. samgipson # brightonSEO Nail it.
  102. 102. Further reading/resources samgipson # brightonSEO
  103. 103. samgipson # brightonSEO Tools Articles Chrome Extension: Robots Exclusion Checker samgipson.com/robots/ ContentKing: Robots.txt for SEO contentkingapp.com/academy/robotstxt/ Ayima: Robots.txt Parser ayima.com/robots/ Builtvisible: An SEO Guide to Robots.txt builtvisible.com/wildcards-in-robots-txt/ Google’s Webmaster Robots.txt Testing Tool google.com/webmasters/tools/robots-testing-tool Original Robots.txt Draft (1996) robotstxt.org/norobots-rfc.txt Google’s C++ robots.txt parser and matcher github.com/google/robotstxt Google’s Robot Exclusion Protocol Draft (2019) ietf.org/archive/id/draft-rep-wg-topic-00.txt
  104. 104. Thank you. samgipson # brightonSEO @samgipson samgipson samgipson.com
  • wollevie

    Oct. 5, 2020
  • kmcvey

    Oct. 1, 2020

A presentation by Sam Gipson detailing the common mistakes found when analysing 40,000 robots.txt files across some of the top-performing sites on the web. This talk also covers changes to the Robots Exclusion Protocol (REP), some useful tips for managing a robots.txt and features a very useful Chrome Extension (Robots Exclusion Checker) that inspired this analysis.

Views

Total views

840

On Slideshare

0

From embeds

0

Number of embeds

91

Actions

Downloads

24

Shares

0

Comments

0

Likes

2

×