Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Filip Podstavec - Get inside the head of a crawler

787 views

Published on

In 2014, Filip managed to exceed our audience's expectations with a well-researched and energizing lecture. Since then, he's managed to build a successful tech startup and has worked for clients across the globe. We are very proud to present you one of the brightest minds on the Czech marketing scene!

Published in: Data & Analytics

Filip Podstavec - Get inside the head of a crawler

  1. 1. GetINSIDE THE SEARCH ENGINE CRAWLER HEAD Filip Podstavec
  2. 2. Dr.
  3. 3. Supermodel gynecologist
  4. 4. PSYCHOLOGISTSPSYCHOLOGISTSPSYCHOLOGISTS
  5. 5. https://orig14.deviantart.net/9386/f/2015/100/e/8/mixels_cc__1_dizzy_robot_ by_supercoco142-d8p7kph.png
  6. 6. Goal: FASTEST
  7. 7. FASTEST MOST ACCESSIBLE Goal:
  8. 8. FASTEST MOST ACCESSIBLE RICHEST Goal:
  9. 9. Why?
  10. 10. Higher crawl rate Benefits
  11. 11. Higher crawl rate More indexed pages Benefits
  12. 12. Higher crawl rate More indexed pages Make website faster Benefits
  13. 13. Higher crawl rate More indexed pages Make website faster Cleaner architecture Benefits
  14. 14. Higher crawl rate More indexed pages Make website faster Cleaner architecture Fixed error pages Benefits
  15. 15. January 2016 April 2016 July 2016 October 2016 January 2017 Organic traffic boost Start of crawl budget optimization
  16. 16. Higher crawl rate ! Direct ranking signal =
  17. 17. How?
  18. 18. Make diagnosis
  19. 19. Medical History 12.6.2010 | Filip Podstavec | sick | height: 183cm | 120/60 17.9.2010 | Filip Podstavec | sick (again) | height: 184cm | 120/60 06.4.2014 | Filip Podstavec | diarrhea | height: 184cm | 120/60 Filip Podstavec | aneurisma| 0
  20. 20. Medical History 12.6.2010 | Filip Podstavec | sick | height: 183cm | 120/60 17.9.2010 | Filip Podstavec | sick (again) | height: 184cm | 120/60 06.4.2014 | Filip Podstavec | diarrhea | height: 184cm | 120/60 Filip Podstavec | aneurisma| 0
  21. 21. 155.62.100.122 - - [25/Aug/2017:08:22:55 -0400] “GET /category/mktfest/ HTTP/1.1” 200 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
  22. 22. 155.62.100.122 [25/Aug/2017:08:22:55 -0400] GET /category/mktfest/ HTTP/1.1 200 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  23. 23. IP (WHO) Timestamp (WHEN) Method and requested URL (WHAT) Status code (SUCCESSFULLY OR NOT) User-agent (DETAILS)
  24. 24. “The biggest source of website URLs”
  25. 25. Web Logs
  26. 26. Web Logs
  27. 27. Web Logs
  28. 28. The most accurate source about crawl budget
  29. 29. What is the crawl budget?
  30. 30. Crawl budget is...
  31. 31. How does a Googlebot allocate the crawl budget for your website?
  32. 32. How much time it allocates: Authority * Amount of acquired new info
  33. 33. How do you use that time: Speed
  34. 34. How do you care about future crawling: Internal linking
  35. 35. “People choose the paths that grant them the greatest rewards for the least amount of effort. “ Dr. House / David Shore
  36. 36. “Crawlers choose the paths that grant them the greatest rewards for the least amount of effort.“ Filip Podstavec
  37. 37. “Crawlers choose the paths that grant them the greatest rewards for the least amount of effort.“ Filip Podstavec
  38. 38. Do I have logs? Yes, but I don’t know about that No Yes
  39. 39. mktfest_com.log (146.55 GB) podstavec_cz.log (11.07 GB)
  40. 40. Tools for log analysis
  41. 41. Static Real-time vs.
  42. 42. Static
  43. 43. < 50MB (Small Files)
  44. 44. 50MB - 10GB (Medium Files) Google BigQuery OpenRefine Screaming Frog Log Analyzer
  45. 45. > 10 GB (Large Files) Google BigQuery
  46. 46. Real-time
  47. 47. Real-time logs
  48. 48. = 20 minutes = 20$ / m + Server
  49. 49. Bit.ly/mktfestfilip
  50. 50. What to check in logs: (diagnosis)
  51. 51. #1 Which search engine robots do crawl my website and how much?
  52. 52. Bot requests in last 7 days Googlebot Bingbot Slurp SeznamBot Other
  53. 53. Bot requests in last 7 days #1 Bot requests Googlebot Bingbot Slurp SeznamBot Other Googlebot Bingbot Slurp SeznamBot Other
  54. 54. DESKTOP : MOBILE : IMAGE Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html) Googlebot-Image/1.0 ​Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safa- ri/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  55. 55. #2 Googlebot agents Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html) Googlebot-Image/1.0 ​Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safa- ri/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  56. 56. #3 Which pages do they visit with the highest frequency?
  57. 57. Most visited URLs by Googlebot
  58. 58. Most visited URLs by Googlebot #3 Most visited URLs
  59. 59. Most crawled URL?
  60. 60. .txt
  61. 61. #4 Which status codes and how much of them does a Googlebot crawl?
  62. 62. 200 404 301 302 500
  63. 63. #4 Pie chart of Googlebot status codes 200 404 301 302 500 200 404 301 302 500
  64. 64. 200 404 301 302 500
  65. 65. Where/when/why do they request error pages?
  66. 66. Googlebot hits with status code higher than 399
  67. 67. #5 Googlebot errors
  68. 68. #6 User errors
  69. 69. Does somebody crawl my website?
  70. 70. Of your traffic are bots! 61.8%
  71. 71. #7 IP Requests #8 Googlebot IPs
  72. 72. What can you do? (content point of view)
  73. 73. Content
  74. 74. is
  75. 75. What can you do? (technical point of view)
  76. 76. 503Service unavailable
  77. 77. “Be in touch with somebody responsible for infrastructure!” Filip Podstavec
  78. 78. Communicate with your sysadmin
  79. 79. https://twitter.com/dimensionmedia/status/877513185238151168
  80. 80. DISALLOW
  81. 81. “Not all of your URLs should be crawled or indexed”
  82. 82. Links
  83. 83. URL1 URL2 URL3 URL4 URL5
  84. 84. <a href=”a/”> /a/ /a/a/ /a/a/a/ /a/a/a/a/
  85. 85. Sanitize the pagination
  86. 86. Rel=”prev” <link rel=”prev” href=”http://www.example.com/page1” /> Rel=”next” <link rel=”next” href=”http://www.example.com/page3” /> Noindex,follow <meta name=”robots” content=”noindex, follow”>
  87. 87. Sanitize the filter combinations
  88. 88. http://edition.cnn.com/videos/tech/2016/08/26/black-hole-breakthrough-lee-pkg.cnn
  89. 89. 10 filters 25 variants
  90. 90. 95 367 431 640 625 Variants
  91. 91. 3.8 Years
  92. 92. How to fix that? Create rules like:
  93. 93. Disallow filters without search volume (price, etc.) Example: Eshop.tld/notebooks/price-0-1000/ Robots.txt block: Disallow: */price
  94. 94. Disallow more than one used filter from each segment Example: Eshop.tld/notebook/acer,apple,lenovo/ Robots.txt block: Disallow: /*,*,*
  95. 95. Delete unused filters by users
  96. 96. Use pseudo checkbox Example: <input><label><a>Select link</a></label> Example URL: https://notebooky.heureka.cz/
  97. 97. Avoid thin content crawling
  98. 98. Disallow: /directory/ <meta name=”robots” content=”noindex”>
  99. 99. Sometimes speed matters
  100. 100. Google PageSpeed Insights
  101. 101. Google Lighthouse
  102. 102. https://varvy.com/ 
  103. 103. Keep your sitemap clean
  104. 104. Are you happy?
  105. 105. Thank you! Filip Podstavec THE MAIN CONSTRUCTER OF MARKETING MINER FILIP@MARKETINGMINER.COM

×