Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How does Google crawl the web? - Botify at SMX Paris 2018

10,059 views

Published on

Learn brand new insights about how Google crawls the web! Dive into the takeaways of this Botify study based on the analysis of 413 million pages crawled by Botify and 6 billion Googlebot requests.

Published in: Marketing

How does Google crawl the web? - Botify at SMX Paris 2018

  1. 1. Dimitri Brunel Search Data Strategist @botify.com - SEO manager in Marketing Agencies & Pure Players (previously) - Currently part of the Botify Search Data Strategist team How does Google crawl the web ? Shedding light on common misconceptions and theories about Googlebot and SEO Alpha Keita Search Data Strategist @botify.com - Onboarding & API training Manager @Botify - Currently part of the Botify Search Data Strategist team
  2. 2. Load Times impact Crawl Ratio
  3. 3. Do load times impact all websites the same way? How much do load times impact Google’s crawl?
  4. 4. SMALL WEBSITES BIG WEBSITES
  5. 5. Add a scientific approach to the empiric one for detailed and data-backed insights. Scale-up the dataset from a single website to a full set of websites from different industries. Confirm or invalidate Google’s behavior and discover new ones. Be more efficient in our SEO in order to continually improve Googlebot’s efficiency and user experience. Improved SEO Methodology Real Insights More Precise Analysis Share a Belief
  6. 6. Percentage of compliant pages (indexable pages) crawled by Google in 30 days. Average number of times a website’s URL was crawled by Google in 30 days. URLs that meet the following requirements: Crawl Ratio Crawl Frequency Compliant URL Canonical tag to self or not set HTTP 200 Status Code Text/HTML Content Index Status (no noindex meta tag)
  7. 7. CRAWL RATIO 49% ACTIVE PAGES RATIO 23% CRAWL FREQUENCY 2.3 Percentage of compliant pages in the website’s structure crawled by Google in 30 days. Percentage of pages that have generated at least one organic visit in 30 days. Average number of times a website’s URL was crawled by Google in 30 days.
  8. 8. A website’s size is one of the most important factors impacting Google’s crawl.
  9. 9. Some KPIs like the number of orphan pages, the load time, or the percentage of words vs. template, have almost no impact on small websites but have a huge impact on big websites. Orphan Pages Load Time % of Words vs. Template The larger the website, the greater the impact: PageRank Depth Content Size Huge impact regardless of size: Other KPIs like the PageRank dilution, depth, or surprisingly content size, have a big impact on Google’s crawl, regardless of website size.
  10. 10. The data showed that bad HTTP Codes only had a small impact on Google’s crawl.
  11. 11. * Anonymized data
  12. 12. Data from Botify Analytics. Data from log files from the same websites. Data calculated with 30 days of logs. Websites that fall in one of the following industries: Retail Publisher Classified
  13. 13. Websites 270 pages crawled and analyzed by Botify. 413 Million pages crawled by Google and analyzed by Botify. 6.2 Billion *15% of the data crawled and analyzed each month by Botify.
  14. 14. Industries Analyzed Dataset by Website Size (in pages)
  15. 15. WEBSITE 1. Industry 2. Size STRUCTURAL KPIS 1. PageRank 2. Load Times 3. Depth 4. Outlinks 5. Content Size TYPE OF PAGES 1. Not Compliant Pages 2. Bad HTTP Codes 3. 304 HTTP Codes 4. Orphan Pages
  16. 16. WEBSITE 1. Industry ✘ 2. Size ✓ STRUCTURAL KPIS 1. PageRank ✓ 2. Load Times ✓ 3. Depth ✓ 4. Outlinks ✘ 5. Content Size ✓ TYPE OF PAGES 1. Not Compliant Pages ✓ 2. Bad HTTP Codes ✘ 3. Orphan Pages ✓
  17. 17. Website related elements
  18. 18. Expected Results Similar crawl rate Different crawl frequency depending on industry CLASSIFIED PUBLISHERRETAILER
  19. 19. CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY ● Googlebot impartially crawls on the web. ● Googlebot crawls impartially regardless of industry. ● Publishers tend to have more active pages (in %). From our past Experience From the analysis of the Dataset Confirmation
  20. 20. CRAWL FREQUENCY BY INDUSTRY From the analysis of the Dataset New learnings!
  21. 21. Expected Results Decreasing crawl ratio Adaptative crawl frequency > 10K PAGES > 1 MILLIONS PAGES > 100K PAGES < 10K PAGES
  22. 22. CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE From our past Experience From the analysis of the Dataset Confirmation ● More pages means more difficulties for Googlebot. ● More pages means fewer active pages in the SERPs (in %). ● Small websites are better crawled by Google but still not crawled entirely. ● Big websites have a harder time effectively using Crawl Budget.
  23. 23. CRAWL FREQUENCY BY WEBSITE SIZE From the analysis of the Dataset Confirmation Big websites tend to have more long tail pages that will be less frequently crawled by Google. Good news: this can be influenced with crawl budget optimization.
  24. 24. Type of Pages related elements
  25. 25. 3# - Not Compliant Pages and or Canonical tag set not to self Not text nor HTML content ● Badly indexable pages from a technical POV. ● Shows bad crawl signal for web spiders. Risk Expected Results Weakest Crawl Ratio Weak Indexation Lower Crawl Frequency At last a composite indicator Noindex status HTTP codes other than 200 status code
  26. 26. COMPLIANT PAGES CRAWLED BY BOTIFY VS. NOT COMPLIANT PAGES CRAWLED BY BOTIFY From our past Experience From the analysis of the Dataset Confirmation ● The proportion of not compliant (37%) pages is still too important vs. aiming for total indexability (100% of compliant pages). ● The overall average shows that SEO still have room for optimization. ● From our past experience we see that many websites still face this problem, usually because of: ○ Extensive use of noindex ○ Server errors ○ Incorrect canonical annotations 413M pages crawled
  27. 27. CRAWLED COMPLIANT PAGES VS. CRAWLED NOT COMPLIANT PAGES From our past Experience From the analysis of the Dataset Confirmation ● As most websites have a huge proportion of not compliant pages, Google is on average wasting 16% of its time crawling these useless pages, when it could focus instead on more interesting pages for searchers. ● Google is wasting time crawling not compliant pages.
  28. 28. CRAWL RATIO VS. % OF NOT COMPLIANT PAGES CRAWLED BY GOOGLE From our past Experience From the analysis of the Dataset Confirmation ● When the proportion of not compliant pages crawled by Google increases, the crawl ratio decreases. ● We expect that having more not compliant pages crawled by Google will have a negative impact on the compliant page’s crawl ratios.
  29. 29. LESS THAN 100K PAGES MORE THAN 100K PAGES From the analysis of the Dataset Confirmation ● Low impact on small websites but huge on medium sites.
  30. 30. Expected Results Slow down / stop crawl Impact crawl efficiency #4 - Bad HTTP Codes 404 302 500 200304
  31. 31. HTTP CODES DISTRIBUTION From our past Experience From the analysis of the Dataset New Learnings! ● The overall situation is quite good (code 200). ● The code 304 is truly under used. 304 is not commonly used by SEOs ● From our experience, we see many problems related to bad HTTP codes: ○ Temporary redirect, redirect chains, redirect loops ○ Client errors, server errors...
  32. 32. CRAWL RATIO VS. CRAWL SHARE IN % ON BAD HTTP CODES From our past Experience From the analysis of the Dataset New Learnings! ● We don’t see a huge impact on crawl ratio. ● Potential reason: most bad HTTP codes in the dataset are 3xx. These don’t consume much crawl budget. ● We could expect bad HTTP status codes to have a big impact on Google’s crawl ratio.
  33. 33. #5 - Orphan Pages ● that are outside of the website structure, ● that we did not discover, ● that Google crawled, ● that received crawl budget. Expected Results Cannibalization of crawl budget Lowering the crawl ratio of the site structure PAGES Crawled by BOTIFY Crawled by GOOGLE Crawled by Google AND Botify
  34. 34. CRAWL VOLUME ON STRUCTURE PAGES VS. CRAWL VOLUME ON ORPHAN PAGES From our past Experience From the analysis of the Dataset Confirmation ● On avg. orphan pages steal ¼ of the crawl. ● We see a lot of orphans URLs. ● Common reasons: ○ Old implementations or technical regressions ○ No DNS cleaning
  35. 35. CRAWL RATIO VS. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our past Experience From the analysis of the Dataset Confirmation ● Orphan pages tend to cannibalize crawl budget and impact the crawl ratio of the structural pages. ● As the percentage of orphan pages increases, the crawl ratio should be negatively impacted. Few orphans = Better crawl ratio More orphans = Lower crawl ratio
  36. 36. LESS THAN 100K PAGES MORE THAN 100K PAGES From the analysis of the Dataset New learnings! From our past Experience ● This is very true on big and gigantic websites only. ● Crawl budget cannibalization whatever the size of the website.
  37. 37. Structural related elements
  38. 38. #6 - Internal PageRank Expected Results Diluting the Internal PageRank on Not Compliant Pages Should Positively Impact Google’s Crawl Ratio on Compliant Pages The popularity spread into the website internal structure A strong crawl signal supposed to pilot the Googlebot(s)
  39. 39. CRAWL RATIO VS. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our past Experience From the analysis of the Dataset Confirmation ● If compliant pages get more PageRank, their crawl ratio should improve. Pro Tips: ● Don’t waste PR with nofollow and noindex tags. ● Crawl ratio ⇔ opportunity to improve your links. better crawl ratio = rework your links!
  40. 40. The number of physical clicks from the home page #7 - Depth Expected Results Slow Down Crawl Potential Crawl Budget Waste # Folders Depth # Clicks from the Home Page
  41. 41. CRAWL RATIO VS. AVG. DEPTH IN ANY WEBSITE STRUCTURE From our past Experience From the analysis of the Dataset Confirmation ● We know depth is an SEO / UX problem: ○ Catalog size ○ Faceted navigation ○ Structure pruning to cut low value content Avg. Depth ● Websites with a higher average depth should be less crawled by Google.
  42. 42. #9 - Load Time Expected Results Idle Crawl Huge Impact on Crawl Ratio We consider from a web crawler “point of view” : - The time to first byte (webserver responsiveness) + - The time to download the page HTML source (the DOM).
  43. 43. CRAWL RATIO VS. LOAD TIMES IN MILLISECONDES From our past Experience From the analysis of the Dataset ● When we look at all the websites sizes, load times don’t seem to have a big impact on Google’s crawl. ● With higher average load time, crawl ratio should decrease. Disturbing fact here Your target
  44. 44. LESS THAN 10K PAGES From the analysis of the Dataset New learnings! ● Small websites ⇔ Low impact of load times ● Big websites ⇔ Huge impact of load times ● With higher average load time, crawl ratio should decrease. Limited impact MORE THAN 10K PAGES From our past Experience Dramatic impact
  45. 45. #10 - Number of Internal Outlinks Expected Results Quantity is not Quality Impact on crawl ratio when too many Either Follow Nofollow To a compliant page To a not compliant page
  46. 46. CRAWL RATIO VS. NO. OF OUTLINKS PER PAGE CRAWL RATIO VS. NO. OF OUTLINKS TO NOT COMPLIANT PAGES From the analysis of the Dataset No Confirmation From our past Experience ● Google’s crawl doesn’t seem to be impacted by the number of outlinks. ● Less outlinks ⇔ Better crawl ratio ● Bad outlinks ⇔ Slightly decrease the crawl ratio
  47. 47. #11 - Percentage of Content Expected Results Low percentage of “real” content often means heavier pages Heavier pages are more difficult to crawl for Google Percentage of Content REAL Content TEMPLATE Content
  48. 48. LESS THAN 10K PAGES MORE THAN 1M PAGES From the analysis of the Dataset New learnings! From the analysis of the Dataset Confirmation ● Small websites => Limited impact of the % of content vs. template ● Big websites => Huge impact of the % of content vs. template Limited impact Awesome impact
  49. 49. #12 - Content Size (in words not ignored) The number of words on a page, excluding the template. Expected Results The more content on average, the more crawled Yet a limited impact on Google crawl
  50. 50. CRAWL RATIO VS. CONTENT SIZE (IN WORDS) From the analysis of the Dataset Confirmation From our past Experience ● Content size impacts Google’s Crawl. ● Websites with more content should be more crawled by Google but we do not expect a very high impact on Google’s crawl. ● Content size is more impactful on Google’s crawl than we expected. New learnings!
  51. 51. From the analysis of the Dataset New learnings! ● Content Size positively impacts Google’s crawl for every size of website. Medium impact LESS THAN 100K PAGES MORE THAN 1M PAGES Good impact Awesome impact BETWEEN 100K AND 1M PAGES
  52. 52. Website size dramatically impacts Google’s crawl ratio. Even small websites are not crawled at 100% by Google. Crawl Budget matters. #1 #2
  53. 53. Content size matters: Build high quality unique content Orphans volume matters: Like water, don’t waste the crawl budget! Structure depth matters: Don’t be afraid, prune the useless branches! #3 #4 #5
  54. 54. OPPORTUNITIES : 1. Analyze a larger set of websites (iterate on the dataset) 2. Extend the duration of the study (6 months, 12 months, 24 months) 3. Increase the list of KPIs to test (nofollow, noindex, etc.) 4. Cross even more SEO KPIs 5. Extend the data to Keywords (impressions, positions, clicks, etc.) 6. Consider seasonality in the analysis (trending topics, breakout topics, etc.)
  55. 55. Googlebot Smartphone’s current percentage of Google’s crawl. 15%
  56. 56. GOOGLEBOT(S) DISTRIBUTION BY INDUSTRYGOOGLEBOT(S) DISTRIBUTION DESKTOP MOBILE From the analysis of the DatasetFrom our past Experience ● Rolling out the Mobile-First Index takes lot of time.● Googlebot desktop is still very present.
  57. 57. Book a Demo to learn what Botify can do for you:

×