Checking google index status at scale

Checking Google Index status at scale with Node.js
Checking
Google Index status
at scale with Node.js
Jose Luis Hernando
@jlhernando #BrightonSEO
Senior Technical SEO Consultant

Today’s agenda
1. Why it’s important to know your website’s indexing status
2. The challenge to extract this data
3. Getting the data with Node.js – Live Demo!
4. Using this data for your SEO strategy

Why is it important?
Reason #1
Not in the Index => Not in the SERPs
Icons from Google, Flaticon & Sitecheckerpro

Reason #2
Google evaluates site quality based on indexed pages
Sources:
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Low Quality Pages
Uncontrolled Faceted Navigation URLs
Unsupervised User Generated Content
Indexable Non-Canonical URLs
High Quality Pages
Category Pages
Editorial Pages
Canonical Product Pages
+

Reason #3
Inefficient use of Google’s resources
https://website.com/category-one/
HTML CSS JS
/category-one/?color=red
/category-one/?color=blue
/category-one/?color=red&blue
…
∞

71.7%
54.3%
41.7%
34.4%
45.3%
30.2%
15.1%
10.1%
1-10k
10k-100k
100k-1M
1M+
Avg. Crawl Ratio (%) Avg. Active Ratio (%)
Source: How Does Google Crawl the Web? – (Annabelle Bouard & Dimitri Brunel – Botify)
Crawl Ratio
Percentage of pages
crawled by Google in 30 days
Active Ratio
Percentage of pages that
have generated at least
one organic visit in 30 days.
How much of your site is Googlebot crawling?

The challenge
to extract this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status

The challenge:
extracting this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status
• You rely on partial and sometimes
inaccurate data points:
• site: & inurl: operators
• GSC Indexing reports:
• URL Inspection Tool (< 200 URLs /day)
• Coverage Reports (< 1,000 rows / report)

Proxy metrics != Accurate data

If you can’t find it, build it

{Live demo}
bit.ly/google-index-checker-script

Using the following method
goes against Google’s Terms of Service
as it automatically requests search queries from Google Search
Quick FYI

Our script outperforms every other method available

How can you use Google index data?
Identify inefficient use of
crawl budget
Error Prioritisation
Identify holes
in your architecture
Check for pages from your
site that should be indexed
but are not.
Find pages that should not
be indexed but are indexed.
Detect pages that used to
exist and now return an error
(4xx) but are still indexed.

Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772 URLs
80% Indexed 74,223
7,465
Google Index Status of 2xx URLs
from Sitemap
Indexed Not Indexed

Use case #1
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
80% Indexed
21% Indexed
6,268
23,701
from Sitemap
Indexed Not Indexed

Use case #1
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
• 301 Status Code – 365
80% Indexed
21% Indexed
4% Indexed
16 349
from Sitemap
Indexed Not Indexed

Next Steps
1) Identify if these URLs are important to your site’s bottom line
2) Check if a pool of these URLs have issues on GSC’s
Index Coverage Report
3) Choose a tactic to improve the visibility of these URLs
4) Isolate the relevant URLs and modify the existing sitemap or create a
new-sitemap.xml to monitor progress

Use case #2
Log File Analysis Plus+
How many URLs with Googlebot hits are
indexed?
• ~160k Googlebot hits to non-canonical URLs
(/Uppercase/ vs /lowercase/)
• Identified if non-canonical URLs were indexed
• Identified if the referenced canonical URLs
were indexed
35.8%
64.2%
Indexed Non-Canonical URLs
Requested by Googlebot
Indexed Not Indexed
Undisclosed Client

Log File Analysis+
Next Steps
1) Identify if the canonical tag is correctly placed
2) Identify if the root cause is internal linking, external linking or other
3) Consider redirecting non-canonical URLs to canonical URLs
4) Create a new-sitemap.xml with problematic URLs to encourage
Googlebot revisiting those URLs and for monitoring purposes

• Check Real-time indexing (News sites, Offer sites, Job Boards)
• Check uncontrolled faceted navigation (Crawl budget optimisation)
• Check inactive product/category URLs – (Site architecture
improvements)
• Check old 4xx that are live now & haven't been deindexed yet
(Recover organic opportunities)
Other use cases
Inform your SEO strategy

Further reading
https://bit.ly/google-index-checks

Further reading
https://bit.ly/gsc-index-coverage

The Google Index Checker script has opened a door
to get useful, actionable data at scale for your sites
Use it, and act on it.

Thank you.
builtvisible.com
Jose Luis Hernando
Senior Technical SEO Consultant
@jlhernando

How does Google crawl the web – Annabelle Bouard & Dimitri Brunel (Botify)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
Data Secrets of the Index Coverage Report - Blind Five Year Old (AJ Kohn)
How Google Search Works – Google Documentation
How Search organises information – Google Documentation
Our new search index: Caffeine - Carrie Grimes
When indexing goes wrong: how Google Search recovered from indexing issues & lessons learned since -
Vincent Courson, Google Search Outreach
How Search Engines Work: Crawling, Indexing & Ranking – Moz
(Please) Stop Using Unsafe Characters in URLs – Jeff Starr
Sources & additional reading

Checking google index status at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Checking google index status at scale

Similar to Checking google index status at scale (20)

More from Builtvisible

More from Builtvisible (20)

Recently uploaded

Recently uploaded (20)

Checking google index status at scale