How does Google
crawl the web?
A study based on the analysis of 413 million
pages crawled by Botify and 6 billion
Googlebot requests.
Annabelle Bouard
Search Data Strategist @botify.com
Today’s Presenter
@botify - #BotifyWebinar
Dimitri Brunel
Search Data Strategist @botify.com
Today’s Agenda
Goal: Better understand Google's behavior with a scientific approach.
● Methodology & Definitions
● The 1st study of its kind based on real customer data globally
● We put REAL figures on what SEOs have known but could not prove
● Insights from our study
● How can you go further?
@botify - #BotifyWebinar
Methodology &
Definitions
@botify - #BotifyWebinar
Scale-up the dataset
from a single website
to a full set of
websites from
different industries.
Add a scientific
approach to the
empiric one for
detailed and
data-backed insights.
Improved SEO
Methodology
More Precise
Analysis
Confirm or Invalidate
Google’s
behavior and
Discover new ones.
Real Insights
Be more efficient in
our SEO in order to
continually improve
Googlebot’s
efficiency and user
experience.
Share a Belief
@botify - #BotifyWebinar
Definitions for today’s session
URLs that meet the following
criteria:
● HTTP 200 Status Code
● Self Referential Canonical
● No noindex Tag
● HTML Content Type
Percentage of compliant pages
(indexable pages) crawled by
Google in 30 days.
Average number of times a
website’s URL was crawled by
Google in 30 days.
Compliant URL Crawl Ratio Crawl Frequency
@botify - #BotifyWebinar
The 1st study of its
kind based on real
customer data
globally
* Anonymized data@botify - #BotifyWebinar
We Looked at a Massive Amount of Data
270
413M
6.2B
Websites that fall in one of the following industries:
Retail Publisher Classifieds
Pages analyzed. Data from Botify Analytics and from
log files for these websites.
Googlebot requests analyzed. 30 days of
web server log files for each website.
@botify - #BotifyWebinar
We also looked at websites of all shapes and sizes
Industries Analyzed Dataset by Website Size (in pages)
@botify - #BotifyWebinar
And we looked at a lot of different metrics
● Industry
● Size
● Non-Compliant Pages
● Orphan Pages
● PageRank
● Page Depth
● Load Times
● No. of Outlinks
● Template weight
● Content Size
Type of Website Types of Pages Structure & Content
@botify - #BotifyWebinar
We put REAL
figures on what
SEOs have
known but could
not prove
@botify - #BotifyWebinar
A website’s size is
one of the most
important factors
impacting Google’s
crawl.
Website size matters
@botify - #BotifyWebinar
KPIs are impacted by website size
Some KPIs like the number of orphan pages, the load time, or the
percentage of words vs. template, have almost no impact on small
websites but have a huge impact on big websites.
Orphan Pages Load Time
% of Words vs. Template
The larger the website, the
greater the impact:
PageRank Depth
Content Size
Huge impact regardless of size:
Other KPIs like the PageRank dilution, depth, or surprisingly content
size, have a big impact on Google’s crawl, regardless of website size.
@botify - #BotifyWebinar
Insights from our study
@botify - #BotifyWebinar
What elements really impact Google’s crawl?
WEBSITE
1. Industry ✘
2. Size ✓
STRUCTURAL KPIS
5. PageRank ✓
6. Depth ✓
7. Load Times ✓
8. No. of Outlinks ✘
9. Template Weight ✓
10. Content Size ✓
TYPE OF PAGES
3. Non-Compliant Pages ✓
4. Orphan Pages ✓
@botify - #BotifyWebinar
Website Size
& Industry
@botify - #BotifyWebinar
#1 - Industry
Expected Results
Similar Crawl Rate
Different Crawl Frequency
depending on Industries
CLASSIFIEDS PUBLISHERRETAILER
@botify - #BotifyWebinar
CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY
● Googlebot impartially crawls on the web.
● Googlebot crawls impartially regardless of
industry.
● Publishers tend to have more active pages (in
%).
Google’s crawl ratio is not influenced by industry
From our Experience
Confirmation
From the analysis of the Dataset
@botify - #BotifyWebinar
What's your crawl ratio and active pages ratio?
@botify - #BotifyWebinar
CRAWL FREQUENCY BY INDUSTRY
Publishers are crawled 45% more frequently than other industries
Publishers might be crawled more frequently
because of fresher and higher quality content.
From the analysis of the Dataset
New learnings!
@botify - #BotifyWebinar
What's your crawl frequency?
@botify - #BotifyWebinar
What's your crawl frequency?
@botify - #BotifyWebinar
2# - Website Size
Expected Results
Decreasing Crawl Ratio Adaptative Crawl Frequency
> 10K
PAGES
> 1 MILLIONS
PAGES
> 100K
PAGES
< 10K
PAGES
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE
From our Experience
From the analysis of the Dataset
Websites with >1M pages: Crawl Ratio is significantly
lower than average
● More pages means more difficulties for
Googlebot.
● More pages means fewer active pages in the
SERPs (in %).
● Small websites are better crawled by Google
but still not entirely.
● Large websites have a harder time using
Crawl Budget efficiently.
@botify - #BotifyWebinar
Confirmation
CRAWL FREQUENCY BY WEBSITE SIZE
From the analysis of the Dataset
Pages on large websites with >1M pages are less frequently
crawled than the average
Large websites tend to have more long tail
pages that will be crawled by Google less
frequently.
Good news: this can be influenced with crawl
budget optimization.
@botify - #BotifyWebinar
Type of Pages
@botify - #BotifyWebinar
3# - Non-Compliant Pages
and
or
Canonical tag set not to
self
Not Text/HTML Content
● Poor indexability from a
technical point of vue.
● Negative signals for web spiders.
Risk
Expected Results
Lowest Crawl Ratio Low Indexation Lower Crawl Frequency
At last a composite indicator
NoIndex Status
HTTP codes other
than 200 Status Code
@botify - #BotifyWebinar
Confirmation
COMPLIANT PAGES CRAWLED BY BOTIFY vs. NON-COMPLIANT PAGES CRAWLED BY BOTIFY
From our Experience
From the analysis of the Dataset
37% of pages crawled by Botify were non-compliant pages
● The proportion of Non-Compliant (37%) pages
is still too important vs. aiming for total
indexability (100% of compliant pages).
● The overall average shows that SEO managers
still have room for improvement.
● From our experience we see that many websites still
face this problem, usually because of:
○ Extensive use of Noindex,
○ Server Errors,
○ Canonical annotations.
413M pages crawled
@botify - #BotifyWebinar
Confirmation
CRAWLED COMPLIANT PAGES vs. CRAWLED NON-COMPLIANT PAGES
From our Experience
From the analysis of the Dataset
Google is wasting (at least) 16% of its time and resources
crawling non-compliant pages
● As most websites have a huge proportion of
Non-Compliant Pages, Google is on average
wasting 16% of its time crawling these useless
pages, when it could focus on valuable pages
for SEO traffic.
● Google is wasting time crawling
Non-Compliant Pages.
@botify - #BotifyWebinar
CRAWL RATIO vs. % OF NON-COMPLIANT PAGES CRAWLED BY GOOGLE
From our Experience
From the analysis of the Dataset
Confirmation
The higher the share of Non-Compliant pages crawled, the
lower the Crawl Ratio
● When the proportion of Non-Compliant Pages
crawled by Google increases, the Crawl Ratio
decreases.
● We expect that having more Non-Compliant
Pages crawled by Google will have a negative
impact on the Compliant Page’s Crawl Ratios.
@botify - #BotifyWebinar
LESS THAN 100K PAGES MORE THAN 100K PAGES
For large websites over 100K pages, Crawl Ratio is strongly
impacted by the number of Non-Compliant pages
From the analysis of the Dataset
Confirmation
● Low impact on small websites but huge on medium ones.
@botify - #BotifyWebinar
What's your compliant pages ratio?
Globally and among pages crawled by Google?
#4 - Orphan Pages
● that are outside of the website structure,
● that we did not discover,
● that Google crawled,
● that receive crawl budget.
Expected Results
Cannibalization of crawl
budget
Lowering the crawl ratio
of the site structure
PAGES
Crawled by
BOTIFY
Crawled by
GOOGLE
Crawled by
Google AND Botify
@botify - #BotifyWebinar
Confirmation
CRAWL VOLUME ON PAGES IN STRUCTURE vs. CRAWL VOLUME ON ORPHAN PAGES From our Experience
From the analysis of the Dataset
● On avg. the Orphans Pages steal ¼ of the
crawl.
Orphan pages represent 26% of Google’s crawl
● From our experience, we always see many
Orphans URLs.
● Common reasons:
○ Old implementations, technical regressions,
○ No DNS cleaning, etc.
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our Experience
From the analysis of the Dataset
Too many orphan pages negatively impact the way Google
crawls your site
● These pages tend to cannibalize precious crawl
budget, and impact the Crawl Ratio of pages in
the structure that do not benefit of 100% of the
crawl budget.
● As the percentage of Orphan Pages
increases, the Crawl Ratios should be
negatively impacted.
Let’s dig deeper into the data, once again!
Few orphans
= Higher crawl ratio
More orphans
= Lower crawl ratio
@botify - #BotifyWebinar
New learnings!
LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the DatasetFrom our Experience
● This is true on Large and Very Large websites
only.
● Crawl budget cannibalization whatever the
size of the website.
Especially for large websites where Crawl Ratio is badly impacted
@botify - #BotifyWebinar
What proportion of orphan pages in Google 's crawl,
on your site?
54,57M / (2,86M + 54,57M) = 95%
% among number of URLs
crawled by Botify, here 10M
@botify - #BotifyWebinar
Website Structure
@botify - #BotifyWebinar
#5 - Internal PageRank
Expected Results
Diluting the Internal PageRank on Non-Compliant Pages should
negatively impact Google’s Crawl Ratio on Compliant Pages
How popularity is
distributed within
the website's
internal structure
A strong crawl
signal supposed
to guide
Googlebot(s)
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES
From our Experience
From the analysis of the Dataset
The more the internal PageRank is focusing on compliant
pages, the better will the Crawl ratio be
● If Compliant Pages get more Internal
PageRank, their Crawl Ratio should improve.
Pro Tips:
● Don’t waste PR with Nofollow and Noindex tags.
● Crawl Ratio ⇔ opportunity to improve your links.
Higher crawl ratio
→ Optimize your
linking!
@botify - #BotifyWebinar
What's your internal PR on compliant and non-compliant pages?
(bottom of page)
@botify - #BotifyWebinar
Minimum number of clicks to reach the page from the Home
#6 - Depth
Expected Results
Slow down the Crawl Lower Crawl Ratio
# Folders Depth # Clicks from the Home Page
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. AVG. DEPTH IN ANY WEBSITE STRUCTURE
From our Experience
From the analysis of the Dataset
The depth of a page greatly impacts its chances of being
crawled by Google
● We've known for ages that depth is an SEO /
UX problem:
○ Catalog size,
○ Efficient faceted navigation,
○ Structure pruning to remove useless pages.Avg. Depth
● Websites with a higher average Depth
should be less crawled by Google.
@botify - #BotifyWebinar
What's your average page depth for compliant pages?
@botify - #BotifyWebinar
#7 - Load Times
Expected Results
Idle Crawl
Huge Impact on Crawl
Ratio
We consider from a web crawler's “point of view” :
- The Time to first byte (web server responsiveness) +
- The Time to download the page HTML source (last byte).
@botify - #BotifyWebinar
CRAWL RATIO vs. LOAD TIMES IN MILLISECONDES From our Experience
From the analysis of the Dataset
Can we really believe that Load Times don’t impact Google’s crawl?
● When looking at websites of all sizes, Load
Times don’t seem to have any significant
impact on Google’s crawl.
● With higher average Load Times, the Crawl
Ratios should decrease.
Can we dig deeper into the data to clarify?
Disturbing
fact here
Your
target
@botify - #BotifyWebinar
New learnings!
LESS THAN 10K PAGES
From the analysis of the Dataset
For large websites, Load Times are definitely impacting
Google’s crawl
● Small websites ⇔ Low impact of Load Times
● Big websites ⇔ Huge impact of Load Times
● With higher average Load Times, the Crawl
Ratios should decrease.
Limited
impact
MORE THAN 10K PAGES
From our Experience
Dramatic
impact
@botify - #BotifyWebinar
What's your average load time for compliant pages?
#8 - Number of Internal Outlinks
Expected Results
Quantity is not Quality
Impact on Crawl Ratio
when too many
Either
Follow NoFollow
To a
Compliant page
To a Not
Compliant page
@botify - #BotifyWebinar
No Confirmation
CRAWL RATIO vs. NO. OF OUTLINKS PER PAGE CRAWL RATIO vs. NO. OF OUTLINKS TO NON-COMPLIANT PAGES
From the analysis of the DatasetFrom our Experience
Too many Outlinks don't really impact Google’s crawl
● Google’s crawl doesn’t seem to be very
impacted by the number of Outlinks.
● Less outlinks ⇔ Better Crawl Ratio
● Bad outlinks ⇔ Slightly decrease the Crawl
Ratio
@botify - #BotifyWebinar
Page Content
@botify - #BotifyWebinar
#9 - Percentage of Content
Expected Results
Low percentage of “real” content
often mean heavier page
More difficult to crawl for
Google
Percentage
of
Content
REAL Content
TEMPLATE
Content
@botify - #BotifyWebinar
New learnings!
Confirmation
LESS THAN 10K PAGES MORE THAN 1M PAGES
From the analysis of the Dataset
From the analysis of the Dataset
Heavy templates have a huge impact on large websites’ Crawl
Ratio
● Small websites => Limited impact of the % of
Content VS. Template
Potential reason:
Pages with a low % of Content VS. Template tend to be heavier and therefore have slow Load Times.
We have already seen that large websites are highly impacted by Load Times.
● Big websites => Huge impact of the % of
Content VS. Template
Limited
impact
Significant
impact
@botify - #BotifyWebinar
What's your average template weight?
@botify - #BotifyWebinar
#10 - Content Size
The number of words
in page, excluding
template.
Expected Results
The more content in average,
the more crawled
Yet a limited impact on
Google crawl
@botify - #BotifyWebinar
New learnings!
Confirmation
CRAWL RATIO vs. CONTENT SIZE (in words)
From the analysis of the Dataset
From our Experience
Content Size has a very big impact on Google’s crawl
● Content Size impacts Google’s Crawl.
● Websites with more content should be more
crawled by Google but we do not expect a
very high impact on Google’s crawl.
● Content Size is more impactful on Google’s
crawl than we were expecting.
@botify - #BotifyWebinar
New learnings!
From the analysis of the Dataset
● Content Size positively impacts Google’s crawl for website of any size.
Some
impact
…whatever the size of the website
LESS THAN 100K PAGES MORE THAN 1M PAGES
Positive
impact
Great
impact
BETWEEN 100K AND 1M PAGES
@botify - #BotifyWebinar
What's your average content size?
@botify - #BotifyWebinar
How can you go
further?
Use custom charts
@botify - #BotifyWebinar
After looking at overall, site-wide figures:
Look at pages by type of SEO role (expected traffic)
○ Product pages (ecommerce), article pages (publishing), ads (classifieds)
○ Category pages (lists of products / articles / ads)
→ Breakdown by segment in the report
Use the “advanced selector” feature to explore your data
○ Breakdown by any relevant dimension (content size, depth, linking…)
○ Combine 2 dimensions (segment + content size…)
How can you go further?
What's the impact of load time on Google's crawl?
What's the impact of load time on Google's crawl?
@botify - #BotifyWebinar
What's the impact of content size on Google's crawl?
How to combine with a secondary dimension ?
Avg. load time for compliant pages by segments ?
@botify - #BotifyWebinar
Key
Takeaways
@botify - #BotifyWebinar
Website Size
dramatically impacts
Google’s Crawl Ratio.
Even small websites are not
crawled at 100% by Google.
Crawl Budget matters.
#1 #2
Content size
matters:
Build high-quality
unique content
Orphans Volume
matters:
Like water, don’t
waste crawl budget!
Structure Depth
matters:
Don’t be afraid, prune
useless branches!
#3 #4 #5
Thank you
for your attention
Get in touch!
hello@botify.com
Questions?
How Does Google Crawl the Web?

How Does Google Crawl the Web?

  • 1.
    How does Google crawlthe web? A study based on the analysis of 413 million pages crawled by Botify and 6 billion Googlebot requests.
  • 2.
    Annabelle Bouard Search DataStrategist @botify.com Today’s Presenter @botify - #BotifyWebinar Dimitri Brunel Search Data Strategist @botify.com
  • 3.
    Today’s Agenda Goal: Betterunderstand Google's behavior with a scientific approach. ● Methodology & Definitions ● The 1st study of its kind based on real customer data globally ● We put REAL figures on what SEOs have known but could not prove ● Insights from our study ● How can you go further? @botify - #BotifyWebinar
  • 4.
  • 5.
    Scale-up the dataset froma single website to a full set of websites from different industries. Add a scientific approach to the empiric one for detailed and data-backed insights. Improved SEO Methodology More Precise Analysis Confirm or Invalidate Google’s behavior and Discover new ones. Real Insights Be more efficient in our SEO in order to continually improve Googlebot’s efficiency and user experience. Share a Belief @botify - #BotifyWebinar
  • 6.
    Definitions for today’ssession URLs that meet the following criteria: ● HTTP 200 Status Code ● Self Referential Canonical ● No noindex Tag ● HTML Content Type Percentage of compliant pages (indexable pages) crawled by Google in 30 days. Average number of times a website’s URL was crawled by Google in 30 days. Compliant URL Crawl Ratio Crawl Frequency @botify - #BotifyWebinar
  • 7.
    The 1st studyof its kind based on real customer data globally * Anonymized data@botify - #BotifyWebinar
  • 8.
    We Looked ata Massive Amount of Data 270 413M 6.2B Websites that fall in one of the following industries: Retail Publisher Classifieds Pages analyzed. Data from Botify Analytics and from log files for these websites. Googlebot requests analyzed. 30 days of web server log files for each website. @botify - #BotifyWebinar
  • 9.
    We also lookedat websites of all shapes and sizes Industries Analyzed Dataset by Website Size (in pages) @botify - #BotifyWebinar
  • 10.
    And we lookedat a lot of different metrics ● Industry ● Size ● Non-Compliant Pages ● Orphan Pages ● PageRank ● Page Depth ● Load Times ● No. of Outlinks ● Template weight ● Content Size Type of Website Types of Pages Structure & Content @botify - #BotifyWebinar
  • 11.
    We put REAL figureson what SEOs have known but could not prove @botify - #BotifyWebinar
  • 13.
    A website’s sizeis one of the most important factors impacting Google’s crawl. Website size matters @botify - #BotifyWebinar
  • 14.
    KPIs are impactedby website size Some KPIs like the number of orphan pages, the load time, or the percentage of words vs. template, have almost no impact on small websites but have a huge impact on big websites. Orphan Pages Load Time % of Words vs. Template The larger the website, the greater the impact: PageRank Depth Content Size Huge impact regardless of size: Other KPIs like the PageRank dilution, depth, or surprisingly content size, have a big impact on Google’s crawl, regardless of website size. @botify - #BotifyWebinar
  • 15.
    Insights from ourstudy @botify - #BotifyWebinar
  • 16.
    What elements reallyimpact Google’s crawl? WEBSITE 1. Industry ✘ 2. Size ✓ STRUCTURAL KPIS 5. PageRank ✓ 6. Depth ✓ 7. Load Times ✓ 8. No. of Outlinks ✘ 9. Template Weight ✓ 10. Content Size ✓ TYPE OF PAGES 3. Non-Compliant Pages ✓ 4. Orphan Pages ✓ @botify - #BotifyWebinar
  • 17.
  • 18.
    #1 - Industry ExpectedResults Similar Crawl Rate Different Crawl Frequency depending on Industries CLASSIFIEDS PUBLISHERRETAILER @botify - #BotifyWebinar
  • 19.
    CRAWL RATIO ANDACTIVE PAGES RATIO BY INDUSTRY ● Googlebot impartially crawls on the web. ● Googlebot crawls impartially regardless of industry. ● Publishers tend to have more active pages (in %). Google’s crawl ratio is not influenced by industry From our Experience Confirmation From the analysis of the Dataset @botify - #BotifyWebinar
  • 20.
    What's your crawlratio and active pages ratio? @botify - #BotifyWebinar
  • 21.
    CRAWL FREQUENCY BYINDUSTRY Publishers are crawled 45% more frequently than other industries Publishers might be crawled more frequently because of fresher and higher quality content. From the analysis of the Dataset New learnings! @botify - #BotifyWebinar
  • 22.
    What's your crawlfrequency? @botify - #BotifyWebinar
  • 23.
    What's your crawlfrequency? @botify - #BotifyWebinar
  • 24.
    2# - WebsiteSize Expected Results Decreasing Crawl Ratio Adaptative Crawl Frequency > 10K PAGES > 1 MILLIONS PAGES > 100K PAGES < 10K PAGES @botify - #BotifyWebinar
  • 25.
    Confirmation CRAWL RATIO ANDACTIVE PAGES RATIO BY WEBSITE SIZE From our Experience From the analysis of the Dataset Websites with >1M pages: Crawl Ratio is significantly lower than average ● More pages means more difficulties for Googlebot. ● More pages means fewer active pages in the SERPs (in %). ● Small websites are better crawled by Google but still not entirely. ● Large websites have a harder time using Crawl Budget efficiently. @botify - #BotifyWebinar
  • 26.
    Confirmation CRAWL FREQUENCY BYWEBSITE SIZE From the analysis of the Dataset Pages on large websites with >1M pages are less frequently crawled than the average Large websites tend to have more long tail pages that will be crawled by Google less frequently. Good news: this can be influenced with crawl budget optimization. @botify - #BotifyWebinar
  • 27.
    Type of Pages @botify- #BotifyWebinar
  • 28.
    3# - Non-CompliantPages and or Canonical tag set not to self Not Text/HTML Content ● Poor indexability from a technical point of vue. ● Negative signals for web spiders. Risk Expected Results Lowest Crawl Ratio Low Indexation Lower Crawl Frequency At last a composite indicator NoIndex Status HTTP codes other than 200 Status Code @botify - #BotifyWebinar
  • 29.
    Confirmation COMPLIANT PAGES CRAWLEDBY BOTIFY vs. NON-COMPLIANT PAGES CRAWLED BY BOTIFY From our Experience From the analysis of the Dataset 37% of pages crawled by Botify were non-compliant pages ● The proportion of Non-Compliant (37%) pages is still too important vs. aiming for total indexability (100% of compliant pages). ● The overall average shows that SEO managers still have room for improvement. ● From our experience we see that many websites still face this problem, usually because of: ○ Extensive use of Noindex, ○ Server Errors, ○ Canonical annotations. 413M pages crawled @botify - #BotifyWebinar
  • 30.
    Confirmation CRAWLED COMPLIANT PAGESvs. CRAWLED NON-COMPLIANT PAGES From our Experience From the analysis of the Dataset Google is wasting (at least) 16% of its time and resources crawling non-compliant pages ● As most websites have a huge proportion of Non-Compliant Pages, Google is on average wasting 16% of its time crawling these useless pages, when it could focus on valuable pages for SEO traffic. ● Google is wasting time crawling Non-Compliant Pages. @botify - #BotifyWebinar
  • 31.
    CRAWL RATIO vs.% OF NON-COMPLIANT PAGES CRAWLED BY GOOGLE From our Experience From the analysis of the Dataset Confirmation The higher the share of Non-Compliant pages crawled, the lower the Crawl Ratio ● When the proportion of Non-Compliant Pages crawled by Google increases, the Crawl Ratio decreases. ● We expect that having more Non-Compliant Pages crawled by Google will have a negative impact on the Compliant Page’s Crawl Ratios. @botify - #BotifyWebinar
  • 32.
    LESS THAN 100KPAGES MORE THAN 100K PAGES For large websites over 100K pages, Crawl Ratio is strongly impacted by the number of Non-Compliant pages From the analysis of the Dataset Confirmation ● Low impact on small websites but huge on medium ones. @botify - #BotifyWebinar
  • 33.
    What's your compliantpages ratio? Globally and among pages crawled by Google?
  • 34.
    #4 - OrphanPages ● that are outside of the website structure, ● that we did not discover, ● that Google crawled, ● that receive crawl budget. Expected Results Cannibalization of crawl budget Lowering the crawl ratio of the site structure PAGES Crawled by BOTIFY Crawled by GOOGLE Crawled by Google AND Botify @botify - #BotifyWebinar
  • 35.
    Confirmation CRAWL VOLUME ONPAGES IN STRUCTURE vs. CRAWL VOLUME ON ORPHAN PAGES From our Experience From the analysis of the Dataset ● On avg. the Orphans Pages steal ¼ of the crawl. Orphan pages represent 26% of Google’s crawl ● From our experience, we always see many Orphans URLs. ● Common reasons: ○ Old implementations, technical regressions, ○ No DNS cleaning, etc. @botify - #BotifyWebinar
  • 36.
    Confirmation CRAWL RATIO vs.% OF ORPHAN PAGES CRAWLED BY GOOGLE From our Experience From the analysis of the Dataset Too many orphan pages negatively impact the way Google crawls your site ● These pages tend to cannibalize precious crawl budget, and impact the Crawl Ratio of pages in the structure that do not benefit of 100% of the crawl budget. ● As the percentage of Orphan Pages increases, the Crawl Ratios should be negatively impacted. Let’s dig deeper into the data, once again! Few orphans = Higher crawl ratio More orphans = Lower crawl ratio @botify - #BotifyWebinar
  • 37.
    New learnings! LESS THAN100K PAGES MORE THAN 100K PAGES From the analysis of the DatasetFrom our Experience ● This is true on Large and Very Large websites only. ● Crawl budget cannibalization whatever the size of the website. Especially for large websites where Crawl Ratio is badly impacted @botify - #BotifyWebinar
  • 38.
    What proportion oforphan pages in Google 's crawl, on your site? 54,57M / (2,86M + 54,57M) = 95% % among number of URLs crawled by Botify, here 10M @botify - #BotifyWebinar
  • 39.
  • 40.
    #5 - InternalPageRank Expected Results Diluting the Internal PageRank on Non-Compliant Pages should negatively impact Google’s Crawl Ratio on Compliant Pages How popularity is distributed within the website's internal structure A strong crawl signal supposed to guide Googlebot(s) @botify - #BotifyWebinar
  • 41.
    Confirmation CRAWL RATIO vs.% OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our Experience From the analysis of the Dataset The more the internal PageRank is focusing on compliant pages, the better will the Crawl ratio be ● If Compliant Pages get more Internal PageRank, their Crawl Ratio should improve. Pro Tips: ● Don’t waste PR with Nofollow and Noindex tags. ● Crawl Ratio ⇔ opportunity to improve your links. Higher crawl ratio → Optimize your linking! @botify - #BotifyWebinar
  • 42.
    What's your internalPR on compliant and non-compliant pages? (bottom of page) @botify - #BotifyWebinar
  • 43.
    Minimum number ofclicks to reach the page from the Home #6 - Depth Expected Results Slow down the Crawl Lower Crawl Ratio # Folders Depth # Clicks from the Home Page @botify - #BotifyWebinar
  • 44.
    Confirmation CRAWL RATIO vs.AVG. DEPTH IN ANY WEBSITE STRUCTURE From our Experience From the analysis of the Dataset The depth of a page greatly impacts its chances of being crawled by Google ● We've known for ages that depth is an SEO / UX problem: ○ Catalog size, ○ Efficient faceted navigation, ○ Structure pruning to remove useless pages.Avg. Depth ● Websites with a higher average Depth should be less crawled by Google. @botify - #BotifyWebinar
  • 45.
    What's your averagepage depth for compliant pages? @botify - #BotifyWebinar
  • 46.
    #7 - LoadTimes Expected Results Idle Crawl Huge Impact on Crawl Ratio We consider from a web crawler's “point of view” : - The Time to first byte (web server responsiveness) + - The Time to download the page HTML source (last byte). @botify - #BotifyWebinar
  • 47.
    CRAWL RATIO vs.LOAD TIMES IN MILLISECONDES From our Experience From the analysis of the Dataset Can we really believe that Load Times don’t impact Google’s crawl? ● When looking at websites of all sizes, Load Times don’t seem to have any significant impact on Google’s crawl. ● With higher average Load Times, the Crawl Ratios should decrease. Can we dig deeper into the data to clarify? Disturbing fact here Your target @botify - #BotifyWebinar
  • 48.
    New learnings! LESS THAN10K PAGES From the analysis of the Dataset For large websites, Load Times are definitely impacting Google’s crawl ● Small websites ⇔ Low impact of Load Times ● Big websites ⇔ Huge impact of Load Times ● With higher average Load Times, the Crawl Ratios should decrease. Limited impact MORE THAN 10K PAGES From our Experience Dramatic impact @botify - #BotifyWebinar
  • 49.
    What's your averageload time for compliant pages?
  • 50.
    #8 - Numberof Internal Outlinks Expected Results Quantity is not Quality Impact on Crawl Ratio when too many Either Follow NoFollow To a Compliant page To a Not Compliant page @botify - #BotifyWebinar
  • 51.
    No Confirmation CRAWL RATIOvs. NO. OF OUTLINKS PER PAGE CRAWL RATIO vs. NO. OF OUTLINKS TO NON-COMPLIANT PAGES From the analysis of the DatasetFrom our Experience Too many Outlinks don't really impact Google’s crawl ● Google’s crawl doesn’t seem to be very impacted by the number of Outlinks. ● Less outlinks ⇔ Better Crawl Ratio ● Bad outlinks ⇔ Slightly decrease the Crawl Ratio @botify - #BotifyWebinar
  • 52.
    Page Content @botify -#BotifyWebinar
  • 53.
    #9 - Percentageof Content Expected Results Low percentage of “real” content often mean heavier page More difficult to crawl for Google Percentage of Content REAL Content TEMPLATE Content @botify - #BotifyWebinar
  • 54.
    New learnings! Confirmation LESS THAN10K PAGES MORE THAN 1M PAGES From the analysis of the Dataset From the analysis of the Dataset Heavy templates have a huge impact on large websites’ Crawl Ratio ● Small websites => Limited impact of the % of Content VS. Template Potential reason: Pages with a low % of Content VS. Template tend to be heavier and therefore have slow Load Times. We have already seen that large websites are highly impacted by Load Times. ● Big websites => Huge impact of the % of Content VS. Template Limited impact Significant impact @botify - #BotifyWebinar
  • 55.
    What's your averagetemplate weight? @botify - #BotifyWebinar
  • 56.
    #10 - ContentSize The number of words in page, excluding template. Expected Results The more content in average, the more crawled Yet a limited impact on Google crawl @botify - #BotifyWebinar
  • 57.
    New learnings! Confirmation CRAWL RATIOvs. CONTENT SIZE (in words) From the analysis of the Dataset From our Experience Content Size has a very big impact on Google’s crawl ● Content Size impacts Google’s Crawl. ● Websites with more content should be more crawled by Google but we do not expect a very high impact on Google’s crawl. ● Content Size is more impactful on Google’s crawl than we were expecting. @botify - #BotifyWebinar
  • 58.
    New learnings! From theanalysis of the Dataset ● Content Size positively impacts Google’s crawl for website of any size. Some impact …whatever the size of the website LESS THAN 100K PAGES MORE THAN 1M PAGES Positive impact Great impact BETWEEN 100K AND 1M PAGES @botify - #BotifyWebinar
  • 59.
    What's your averagecontent size? @botify - #BotifyWebinar
  • 60.
    How can yougo further? Use custom charts @botify - #BotifyWebinar
  • 61.
    After looking atoverall, site-wide figures: Look at pages by type of SEO role (expected traffic) ○ Product pages (ecommerce), article pages (publishing), ads (classifieds) ○ Category pages (lists of products / articles / ads) → Breakdown by segment in the report Use the “advanced selector” feature to explore your data ○ Breakdown by any relevant dimension (content size, depth, linking…) ○ Combine 2 dimensions (segment + content size…) How can you go further?
  • 62.
    What's the impactof load time on Google's crawl?
  • 63.
    What's the impactof load time on Google's crawl? @botify - #BotifyWebinar
  • 64.
    What's the impactof content size on Google's crawl?
  • 65.
    How to combinewith a secondary dimension ?
  • 66.
    Avg. load timefor compliant pages by segments ? @botify - #BotifyWebinar
  • 67.
  • 68.
    Website Size dramatically impacts Google’sCrawl Ratio. Even small websites are not crawled at 100% by Google. Crawl Budget matters. #1 #2
  • 69.
    Content size matters: Build high-quality uniquecontent Orphans Volume matters: Like water, don’t waste crawl budget! Structure Depth matters: Don’t be afraid, prune useless branches! #3 #4 #5
  • 70.
    Thank you for yourattention Get in touch! hello@botify.com Questions?