Learn brand new insights about how Google crawls the web! Dive into the takeaways of this Botify study based on the analysis of 413 million pages crawled by Botify and 6 billion Googlebot requests.
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions•31 views
How does Google crawl the web? - Botify at SMX Paris 2018
2. Dimitri Brunel
Search Data Strategist @botify.com
- SEO manager in Marketing Agencies & Pure Players (previously)
- Currently part of the Botify Search Data Strategist team
How does Google crawl the web ?
Shedding light on common misconceptions and theories about Googlebot and SEO
Alpha Keita
Search Data Strategist @botify.com
- Onboarding & API training Manager @Botify
- Currently part of the Botify Search Data Strategist team
8. Add a scientific approach to the
empiric one for detailed and
data-backed insights.
Scale-up the dataset from a single
website to a full set of websites from
different industries.
Confirm or invalidate Google’s
behavior and discover new ones.
Be more efficient in our SEO in order to
continually improve Googlebot’s
efficiency and user experience.
Improved SEO Methodology
Real Insights
More Precise Analysis
Share a Belief
9. Percentage of compliant pages (indexable pages) crawled by Google
in 30 days.
Average number of times a website’s URL was crawled by Google
in 30 days.
URLs that meet the following requirements:
Crawl Ratio
Crawl
Frequency
Compliant
URL
Canonical tag to self or not set
HTTP 200 Status Code
Text/HTML Content
Index Status (no noindex meta tag)
11. CRAWL RATIO
49% ACTIVE PAGES RATIO
23% CRAWL FREQUENCY
2.3
Percentage of compliant pages in the
website’s structure crawled by Google in
30 days.
Percentage of pages that have
generated at least one organic visit in
30 days.
Average number of times a website’s
URL was crawled by Google in 30 days.
12. A website’s size is one
of the most important
factors impacting
Google’s crawl.
13. Some KPIs like the number of orphan pages, the load time, or the
percentage of words vs. template, have almost no impact on small
websites but have a huge impact on big websites.
Orphan Pages Load Time
% of Words vs. Template
The larger the website, the
greater the impact:
PageRank Depth
Content Size
Huge impact regardless of size:
Other KPIs like the PageRank dilution, depth, or surprisingly content
size, have a big impact on Google’s crawl, regardless of website size.
14. The data showed that
bad HTTP Codes only
had a small impact on
Google’s crawl.
16. Data from Botify Analytics.
Data from log files from the same websites.
Data calculated with 30 days of logs.
Websites that fall in one of the following industries:
Retail Publisher Classified
17. Websites
270
pages crawled and
analyzed by
Botify.
413
Million
pages crawled by
Google and analyzed
by Botify.
6.2
Billion
*15% of the data crawled and analyzed each month by Botify.
24. CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY
● Googlebot impartially crawls on the web.
● Googlebot crawls impartially regardless of
industry.
● Publishers tend to have more active pages (in %).
From our past Experience
From the analysis of the Dataset
Confirmation
25. CRAWL FREQUENCY BY INDUSTRY From the analysis of the Dataset
New learnings!
27. CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE
From our past Experience
From the analysis of the Dataset
Confirmation
● More pages means more difficulties for
Googlebot.
● More pages means fewer active pages in the
SERPs (in %).
● Small websites are better crawled by Google
but still not crawled entirely.
● Big websites have a harder time effectively
using Crawl Budget.
28. CRAWL FREQUENCY BY WEBSITE SIZE
From the analysis of the Dataset
Confirmation
Big websites tend to have more long tail pages
that will be less frequently crawled by Google.
Good news: this can be influenced with crawl
budget optimization.
30. 3# - Not Compliant Pages
and
or
Canonical tag set not
to self
Not text
nor HTML content
● Badly indexable pages from a
technical POV.
● Shows bad crawl signal for web
spiders.
Risk
Expected Results
Weakest Crawl Ratio Weak Indexation Lower Crawl Frequency
At last a composite indicator
Noindex status
HTTP codes other
than 200 status code
31. COMPLIANT PAGES CRAWLED BY BOTIFY VS. NOT COMPLIANT PAGES CRAWLED BY
BOTIFY
From our past Experience
From the analysis of the Dataset
Confirmation
● The proportion of not compliant (37%) pages
is still too important vs. aiming for total
indexability (100% of compliant pages).
● The overall average shows that SEO still have
room for optimization.
● From our past experience we see that many websites
still face this problem, usually because of:
○ Extensive use of noindex
○ Server errors
○ Incorrect canonical annotations
413M pages crawled
32. CRAWLED COMPLIANT PAGES VS. CRAWLED NOT COMPLIANT PAGES
From our past Experience
From the analysis of the Dataset
Confirmation
● As most websites have a huge proportion of
not compliant pages, Google is on average
wasting 16% of its time crawling these useless
pages, when it could focus instead on more
interesting pages for searchers.
● Google is wasting time crawling not
compliant pages.
33. CRAWL RATIO VS. % OF NOT COMPLIANT PAGES CRAWLED BY GOOGLE
From our past Experience
From the analysis of the Dataset
Confirmation
● When the proportion of not compliant pages
crawled by Google increases, the crawl ratio
decreases.
● We expect that having more not compliant
pages crawled by Google will have a negative
impact on the compliant page’s crawl ratios.
34. LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the Dataset
Confirmation
● Low impact on small websites but huge on medium sites.
36. HTTP CODES DISTRIBUTION From our past Experience
From the analysis of the Dataset
New Learnings!
● The overall situation is quite good (code 200).
● The code 304 is truly under used.
304 is not
commonly
used by
SEOs
● From our experience, we see many problems
related to bad HTTP codes:
○ Temporary redirect, redirect chains, redirect
loops
○ Client errors, server errors...
37. CRAWL RATIO VS. CRAWL SHARE IN % ON BAD HTTP CODES From our past Experience
From the analysis of the Dataset
New Learnings!
● We don’t see a huge impact on crawl ratio.
● Potential reason: most bad HTTP codes in the
dataset are 3xx. These don’t consume much
crawl budget.
● We could expect bad HTTP status codes to
have a big impact on Google’s crawl ratio.
38. #5 - Orphan Pages
● that are outside of the website structure,
● that we did not discover,
● that Google crawled,
● that received crawl budget.
Expected Results
Cannibalization of crawl
budget
Lowering the crawl ratio
of the site structure
PAGES
Crawled by
BOTIFY
Crawled by
GOOGLE
Crawled by
Google AND Botify
39. CRAWL VOLUME ON STRUCTURE PAGES VS. CRAWL VOLUME ON ORPHAN PAGES From our past Experience
From the analysis of the Dataset
Confirmation
● On avg. orphan pages steal ¼ of the crawl.
● We see a lot of orphans URLs.
● Common reasons:
○ Old implementations or technical regressions
○ No DNS cleaning
40. CRAWL RATIO VS. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our past Experience
From the analysis of the Dataset
Confirmation
● Orphan pages tend to cannibalize crawl
budget and impact the crawl ratio of the
structural pages.
● As the percentage of orphan pages
increases, the crawl ratio should be
negatively impacted.
Few orphans
= Better crawl ratio
More orphans
= Lower crawl ratio
41. LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the Dataset
New learnings!
From our past Experience
● This is very true on big and gigantic websites
only.
● Crawl budget cannibalization whatever the
size of the website.
43. #6 - Internal PageRank
Expected Results
Diluting the Internal PageRank on Not Compliant Pages Should
Positively Impact Google’s Crawl Ratio on Compliant Pages
The popularity
spread into the
website internal
structure
A strong crawl
signal supposed
to pilot the
Googlebot(s)
44. CRAWL RATIO VS. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our past Experience
From the analysis of the Dataset
Confirmation
● If compliant pages get more PageRank, their
crawl ratio should improve.
Pro Tips:
● Don’t waste PR with nofollow and noindex tags.
● Crawl ratio ⇔ opportunity to improve your links.
better crawl ratio
=
rework your links!
45. The number of physical clicks from the home page
#7 - Depth
Expected Results
Slow Down Crawl Potential Crawl Budget Waste
# Folders Depth # Clicks from the Home Page
46. CRAWL RATIO VS. AVG. DEPTH IN ANY WEBSITE STRUCTURE
From our past Experience
From the analysis of the Dataset
Confirmation
● We know depth is an SEO / UX problem:
○ Catalog size
○ Faceted navigation
○ Structure pruning to cut low value content
Avg. Depth
● Websites with a higher average depth
should be less crawled by Google.
47. #9 - Load Time
Expected Results
Idle Crawl
Huge Impact on Crawl
Ratio
We consider from a web crawler “point of view” :
- The time to first byte (webserver responsiveness) +
- The time to download the page HTML source (the DOM).
48. CRAWL RATIO VS. LOAD TIMES IN MILLISECONDES From our past Experience
From the analysis of the Dataset
● When we look at all the websites sizes, load
times don’t seem to have a big impact on
Google’s crawl.
● With higher average load time, crawl ratio
should decrease.
Disturbing
fact here
Your
target
49. LESS THAN 10K PAGES
From the analysis of the Dataset
New learnings!
● Small websites ⇔ Low impact of load times
● Big websites ⇔ Huge impact of load times
● With higher average load time, crawl ratio
should decrease.
Limited
impact
MORE THAN 10K PAGES
From our past Experience
Dramatic
impact
50. #10 - Number of Internal Outlinks
Expected Results
Quantity is not Quality
Impact on crawl ratio
when too many
Either
Follow Nofollow
To a
compliant page
To a not
compliant page
51. CRAWL RATIO VS. NO. OF OUTLINKS PER PAGE CRAWL RATIO VS. NO. OF OUTLINKS TO NOT COMPLIANT PAGES
From the analysis of the Dataset
No Confirmation
From our past Experience
● Google’s crawl doesn’t seem to be impacted
by the number of outlinks.
● Less outlinks ⇔ Better crawl ratio
● Bad outlinks ⇔ Slightly decrease the crawl
ratio
52. #11 - Percentage of Content
Expected Results
Low percentage of “real” content
often means heavier pages
Heavier pages are more
difficult to crawl for
Google
Percentage
of
Content
REAL Content
TEMPLATE Content
53. LESS THAN 10K PAGES MORE THAN 1M PAGES
From the analysis of the Dataset
New learnings!
From the analysis of the Dataset
Confirmation
● Small websites => Limited impact of the % of
content vs. template
● Big websites => Huge impact of the % of
content vs. template
Limited
impact
Awesome
impact
54. #12 - Content Size (in words not ignored)
The number of words
on a page, excluding
the template.
Expected Results
The more content on
average, the more crawled
Yet a limited impact on
Google crawl
55. CRAWL RATIO VS. CONTENT SIZE (IN WORDS)
From the analysis of the Dataset
Confirmation
From our past Experience
● Content size impacts Google’s Crawl.
● Websites with more content should be more
crawled by Google but we do not expect a
very high impact on Google’s crawl.
● Content size is more impactful on Google’s
crawl than we expected.
New learnings!
56. From the analysis of the Dataset
New learnings!
● Content Size positively impacts Google’s crawl for every size of website.
Medium
impact
LESS THAN 100K PAGES MORE THAN 1M PAGES
Good
impact
Awesome
impact
BETWEEN 100K AND 1M PAGES
59. Content size
matters:
Build high quality
unique content
Orphans volume
matters:
Like water, don’t
waste the crawl
budget!
Structure depth
matters:
Don’t be afraid, prune
the useless branches!
#3 #4 #5
61. OPPORTUNITIES :
1. Analyze a larger set of websites (iterate on the dataset)
2. Extend the duration of the study (6 months, 12 months, 24 months)
3. Increase the list of KPIs to test (nofollow, noindex, etc.)
4. Cross even more SEO KPIs
5. Extend the data to Keywords (impressions, positions, clicks, etc.)
6. Consider seasonality in the analysis (trending topics, breakout topics, etc.)
64. GOOGLEBOT(S) DISTRIBUTION BY INDUSTRYGOOGLEBOT(S) DISTRIBUTION
DESKTOP
MOBILE
From the analysis of the DatasetFrom our past Experience
● Rolling out the Mobile-First Index takes lot of time.● Googlebot desktop is still very present.
65. Book a Demo to learn what Botify can do
for you: