Why Crawling Sites is Hard and What You Can Do About It

3,308 views

Published on

In this presentation I gave for the Jane and Robot Developer Summit on October 8, 2009 I discuss some of the common problems sites face with crawlability. I invite developers and SEOs to think like a search engineer to integrate best practices into their every day work.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,308
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Why Crawling Sites is Hard and What You Can Do About It

  1. 1. Why It's Hard to Crawl Your Website And What You Can Do To Help Nick Gerner Software Engineer and Web Crawler SEOmoz.org
  2. 2. Building a Crawler
  3. 3. Search Goals For Web Crawling Requirements v1.0 <ul><li>Collect Data to Answer Queries like: </li><ul><li>“Cannon SLR review”
  4. 4. “Vacation in the Caribbean”
  5. 5. “mod_rewrite apache” </li></ul><li>Send Searchers to Relevant Sites/Pages </li></ul>
  6. 6. Crawling The Web while(morePages) { FetchNextPage ParseXHTML AddNewLinksToCrawl }
  7. 7. Crawling The Web while(morePages) //duplicate content? { //200 OK? 3xx? 5xx, 404? politeness? FetchNextPage ParseXHTML //hah! AddNewLinksToCrawl //but which ones? }
  8. 8. Search Goals For Web Crawling Requirements v2.0 <ul><li>Avoid Duplicate Content
  9. 9. Avoid Spam
  10. 10. Send Searchers to Highest Quality Sites
  11. 11. Don't crash websites with crawl
  12. 12. Keep content up to date
  13. 13. ...iterate for reqs v3.0, 4.0... </li></ul>
  14. 14. What Can You Do To Help?
  15. 15. Build For Users Not Robots! “ The formula is simple: design for people, be smart about robots, and you will achieve long-lasting success.” --Jane and Robot Philosophy
  16. 16. Build For Users Not Robots <ul><li>Robots do not </li><ul><li>buy products
  17. 17. subscribe to services
  18. 18. listen to advice
  19. 19. click ads </li></ul><li>Search engines like websites that have good usability
  20. 20. Search engines do NOT like websites that lie or confuse them
  21. 21. Investing in Usability once pays double! </li></ul>
  22. 22. Problems You Can Help to Solve
  23. 23. Duplicate Content “ A majority of the web is duplicate content” --Joachim Kupke, Google Engineer on Duplicate Content Foo Foo Foo Foo Foo Foo Foo Foo Foo Foo
  24. 24. Duplicate Content <ul><li>A Majority of SEOmoz Consulting is Duplicate Content!
  25. 25. Many sources: </li><ul><li>Search Results
  26. 26. Sorting Lists
  27. 27. Session IDs
  28. 28. View State
  29. 29. Campaign tracking codes
  30. 30. ... </li></ul><li>Example: http://bit.ly/ZPkIE </li></ul>
  31. 31. Duplicate Content Solutions <ul><li>Don't have dups!
  32. 32. robots.txt stuff out ← don't!
  33. 33. Redirect (301)
  34. 34. rel=canonical (at least for Google today)
  35. 35. sitemap.xml ( www.sitemaps.org )
  36. 36. See http://bit.ly/NDYYp </li><ul><li>Jane and Robot “URL Referrer Tracking” </li></ul></ul>
  37. 37. www vs non-www www.searchengineland.com dup dup dup dup dup dup dup dup dup searchengineland.com dup dup dup dup dup dup dup dup dup dup dup
  38. 38. www vs non-www <ul><li>No one cares if you're www or not
  39. 39. ...But choose one!
  40. 40. Stick to it throughout your site
  41. 41. 301 EVERYTHING to the canonical form
  42. 42. My slight preference for www: </li><ul><li>many people will link to it www version
  43. 43. if you've screwed it up that gives you less risk </li></ul></ul>
  44. 44. URL Structure <ul><li>Descriptive and Human Readable
  45. 45. Single name for a single piece of content
  46. 46. Think about usability: </li><ul><li>Click-able
  47. 47. Link-able
  48. 48. Memorable </li></ul></ul>http://my.url.says/a/lot?about=me
  49. 49. URL Structure <ul><li>Good: </li><ul><li>www.example.com/wigets/multi-purpose/blue-sprocket </li></ul><li>Not Great: </li><ul><li>www.example.com/;jsessionid=0683s?pid=33&sid=123&cid=831 </li></ul><li>Worse: </li><ul><li>www.example.com/jsessionid/067x83s/pid/33/sid/123/cid/831 </li></ul><li>Terrible: </li><ul><li>http://www.serde.net/(X(1)A(o0Byn2nEyQEkAAAAZjUxMmI0ZGMtNDA4MS00ZDY0LTk5ZTAtNTA2MWY2M2Q5MTcxHreLE5sE1J_bXKfCHoBGWNFZTJ81)S(np4fn330ilbrliusxoe5k445))/GetThreadsRss.aspx?ForumID=5 </li></ul></ul>
  50. 50. Site Volume and Indexation Holy Crap that's a lot of crawling! Time to prioritize what's important!
  51. 51. Site Volume <ul><li>Organize Content Under a Hierarchy </li><ul><li>Important stuff goes at the top
  52. 52. Good link navigation to important but deep content </li></ul><li>Think about information architecture
  53. 53. Use a sitemap ( www.sitemaps.org ) to tell: </li><ul><li>What's important
  54. 54. What's updated </li></ul></ul>
  55. 55. Bad Architecture Uh... what is important here?
  56. 56. Bad Architecture <ul><li>Complex Navigation
  57. 57. Important Content is Deeply Hidden
  58. 58. Hard to discover important content from unimportant </li></ul>
  59. 59. Good Architecture Oh, I can understand that!
  60. 60. Good Architecture <ul><li>Intuitive Information Layout
  61. 61. Short Path to Relevant Content </li></ul>
  62. 62. HTTP status Codes YourSite.com 200 Not Found Uh... That's not right
  63. 63. HTTP Status Codes <ul><li>200 OK is great for: </li><ul><li>Unique content that you want to deliver to users
  64. 64. Non-canonical content with rel=canonical </li></ul><li>301 is great for non-canonical urls
  65. 65. 302 is temporary!
  66. 66. 304 for conditional get is great
  67. 67. 404 for does not exist (NOT 200!)
  68. 68. 5XX for temporarily down (NOT 200, 404!)
  69. 69. Familiarize yourself with RFC 2616 </li></ul>
  70. 70. Robots Exclusion Protocol & Robots.txt
  71. 71. Robots Exclusion Protocol & Robots.txt <ul><li>Succinct (not 10MB!!)
  72. 72. Use sparingly and simply </li><ul><li>Don't know your future url structure
  73. 73. Don't know future search engines
  74. 74. This is a sledgehammer </li></ul><li>Forget robotstxt.org, use this tool as truth: http://www.google.com/webmasters -> your site -> site configuration -> crawler access -> test robots.txt </li></ul>
  75. 75. Robots Exclusion Protocol & Robots.txt Gotchas <ul><li>URLs are (sometimes) case Sensitive!
  76. 76. Easy to mess up special casing for different crawlers
  77. 77. Avoid duplicate blocks
  78. 78. Some guy's kid got pissed and de-indexed corp site </li></ul>
  79. 79. Robots Exclusion Protocol & Robots.txt Good Stuff <ul><li>Stuff under development </li><ul><li>But why is it in public? </li></ul><li>Search Result Pages </li><ul><li>Unless you're clever
  80. 80. But don't be clever in general! </li></ul><li>Private Information (e.g. PII)
  81. 81. Access Restricted Sections </li><ul><li>Crawlers can't get to em anyway
  82. 82. Better yet, 301 these to a better place </li></ul></ul>
  83. 83. Robots Exclusion Protocol & Meta robots <ul><li>NOINDEX: don't serve this result </li><ul><li>Different than robots.txt disallow </li></ul><li>NOFOLLOW: rel=nofollow on all links
  84. 84. NOODP: don't use dmoz for my snippet </li><ul><li>Just use meta description if you can </li></ul><li>http://searchengineland.com/meta-robots-tag-101-blocking-spiders-cached-pages-more-10665
  85. 85. http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.html </li></ul>
  86. 86. Getting Fancy (AJAX, Flash, etc.) <ul><li>Progressive Enhancement
  87. 87. Graceful Degradation
  88. 88. Accessibility (e.g. for the blind, for the deaf)
  89. 89. KISS
  90. 90. Try your site out: </li><ul><li>With Javascript Disabled (webdeveloper toolbar)
  91. 91. In Telnet </li></ul></ul>
  92. 92. Help! <ul><li>Google Webmaster Tools and Forums: </li><ul><li>http://www.google.com/webmasters </li></ul><li>www.JaneAndRobot.com/
  93. 93. SEOmoz.org </li><ul><li>Blog (free)
  94. 94. Beginner's Guide to SEO (free)
  95. 95. site:seomoz.org Site Architecture (free and paid) </li></ul></ul>
  96. 96. Help! Questions?

×