@patrickstox @ahrefs #pubcon
Product Advisor, Technical SEO, &
Brand Ambassador at
• I write for Ahrefs blog but have written for many industry
publications in the past
• I speak at some conferences like SMX, Pubcon, UnGagged, DMO
Advanced, TechSEO Boost, BrightonSEO
• Organizer for the Raleigh SEO Meetup (most successful in US) and
the Beer & SEO Meetup
• We also run a conference, the Raleigh SEO Conference
• Founder Technical SEO Slack Group
• Moderator /r/TechSEO on Reddit
• Helped define the role of Search Marketing Strategist for the US
Department of Labor
• Lead author for the SEO Chapter of the 2021 Web Almanac, reviewer
for the 2022 Chapter
• Technical Review Editor for The Art of SEO 4th Edition
Who is Patrick Stox?
@patrickstox @ahrefs #pubcon
Disclaimer
This is my understanding of systems and is based on a lot of public statements
from Google and my own knowledge.
Warning: It’s not going to be 100% complete or accurate.
@patrickstox @ahrefs #pubcon
How Many Domains Exist?
Q3 2022 according to Verisign: 349.9 million registered
January 2023 according to Netcraft: 270.9 million unique domains responded
Ahrefs 213.1 million (after removing spam domains)
@patrickstox @ahrefs #pubcon
Googlebot
Googlebot is a lot of systems (1000+) and there are multiple Googlebots.
• Googlebot Image
• Googlebot News
• Googlebot Video
• Googlebot Desktop
• Googlebot Mobile
• +Ads and more
https://developers.google.com/search/docs/crawling-indexing/overview-
google-crawlers
@patrickstox @ahrefs #pubcon
URL Sources
• Links on pages, or anything that even looks like a link
• Sitemaps
• Request indexing in GSC
• Indexing API (limited use cases)
• RSS Feeds
• WebSub (formerly PubSubHubbub)
@patrickstox @ahrefs #pubcon
What SEOs Call Crawl Budget, Google Calls
Crawl demand
How much Google wants to crawl your site.
Crawl rate limit
How much crawling your website can support.
@patrickstox @ahrefs #pubcon
What Counts Against Your Crawl Budget?
All URLs and requests including:
• Pages/files
• Alternate URLs like AMP or m-dot pages, hreflang
• CSS
• JavaScript, including XHR requests
• Embedded content
***All Googlebots share the same crawl budget, including the ones for Ads,
images, etc.
@patrickstox @ahrefs #pubcon
Crawl Demand Factors
• PageRank
• How often pages change (freshness/staleness)
• When it was last crawled
• Any major changes
@patrickstox @ahrefs #pubcon
Crawl Rate Factors
• Stability / crawl health
• Slow responses
• Errors. 5xx (server errors) or 429 (too many requests) HTTP status codes.
They don’t want to crash the sites and the crawlers will generally back down if
they start seeing issues.
@patrickstox @ahrefs #pubcon
Caching Files
Files are stored for use in rendering.
Google will ignore your cache timings and fetch a new copy when they want to.
JS
HTML HTML
HTML JS CSS CSS CSS
Cache
Cache
@patrickstox @ahrefs #pubcon
Web Rendering Service (WRS)
Needed to process JavaScript
Evergreen (up-to-date) Googlebot
Headless (no Graphical User Interface)
@patrickstox @ahrefs #pubcon
Web Rendering Service (WRS)
• Stateless (storage and cookies cleared between loads)
• Denies Permissions
• Flattens light DOM and shadow DOM
• Date / Time functions adjusted
• Service workers rejected
• Animations may differ
• Random may not be random
@patrickstox @ahrefs #pubcon
Myth: 5 Second Limit
I think this started with a test from Max Prin on the time when the testing tools
took a screenshot. They need to have reasonable time limits for testing tools.
https://maxprin.com/tests/js-timer/
@patrickstox @ahrefs #pubcon
No 5 Second Limit
They’ll try to wait for pages to finish, something like networkidle0 (no more
activity).
Eventually cuts off in case something gets stuck or someone is trying to mine
bitcoin.
@patrickstox @ahrefs #pubcon
It Doesn’t Even Make Sense
They’re basically loading a page with everything cached already.
WRS
JS
HTML HTML
HTML
JS CSS CSS CSS
Cache
Cache
@patrickstox @ahrefs #pubcon
This System Causes Other Issues
Impossible states – previous file versions used when rendering.
File versioning /fingerprinting should help.
XHR requests are done in real time.
@patrickstox @ahrefs #pubcon
Myth: Weeks To Render
All pages go through the renderer.
The average wait time is 5 seconds according to Google’s Martin Splitt.
The 90th percentile is only minutes, not weeks.
Probably comes from pages not being prioritized
for crawling.
@patrickstox @ahrefs #pubcon
They Use Some Hacks
“In Google search we don’t really care about the pixels because we don’t
really want to show it to someone. We want to process the information
and the semantic information so we need something in the intermediate
state. We don’t have to actually paint the pixels.” – Martin Splitt
@patrickstox @ahrefs #pubcon
They Don’t Click
Load content into the Document Object Model (DOM) by default. They won’t
see the content if it requires a click that makes an XHR request to pull it in.
DOM Tree and CSS Object Model (CSSOM) form the Render Tree. That’s what
gets indexed.
@patrickstox @ahrefs #pubcon
DOM Tree (pictured)
CSSOM (not pictured) would add info
like font size, weight, color, etc. to
each element.
Render Tree
@patrickstox @ahrefs #pubcon
~20 Canonicalization Signals
• Duplicates
• Redirects (high weight)
• Canonical link elements - multiple will be ignored
• Sitemap URLs
• Links (Internal/External, PageRank)
• Alternates – mobile, AMP, print, Hreflang
• HTTPS pages over HTTP
• Shorter URLs over longer URLs
• Where content was first published / seen
• Site level signals like a history of scraped content
• Pages over PDFs
Machine learning system
@patrickstox @ahrefs #pubcon
Processing – Link Parser
Good:
<a> tag with an href attribute.
<a href=”/page”>simple is good</a>
<a href=”/page” onclick=”goTo(‘page’)”>still okay</a>
@patrickstox @ahrefs #pubcon
Processing – Link Parser
Bad (but may be parsed):
<a routerLink="products/category">no href</a>
<a onclick=”goTo(‘page’)”>no href</a>
<a href=”javascript:goTo(‘page’)”>kind of nested</a>
<a href=”javascript:void(0)”>missing link</a>
<span onclick=”goTo(‘page’)”>not the right HTML element or href</span>
<span href=“page">not the right HTML element</span>
<option value="page">not the right HTML element</option>
<a href=”#”>no link</a>
Button, ng-click, there are many more ways this can be done incorrectly.
@patrickstox @ahrefs #pubcon
Processing – Content Parser
• Content – tokenized, vectorized. Words become numbers.
• Content language
• Content location
• Extract meta tags
• Extract Schema
• HTML Lexer – normalize the HTML
• Topic analysis. Content on other topics may be weighted less in ranking.
• Semantic analysis. Linguistic, knowledge graph, address extraction
• …
@patrickstox @ahrefs #pubcon
A Lot More In Processing Like
Drop anything after # in URLs.
(some exceptions to this)
Most Restrictive Directives
index + noindex + index = noindex
They’ll drop low quality content
@patrickstox @ahrefs #pubcon
Other Files May Be Processed Differently
Adobe Portable Document Format (.pdf)
•Adobe PostScript (.ps)
•Google Earth (.kml, .kmz)
•GPS eXchange Format (.gpx)
•Hancom Hanword (.hwp)
•HTML (.htm, .html, other file extensions)
•Lotus
•Microsoft Excel (.xls, .xlsx)
•Microsoft PowerPoint (.ppt, .pptx)
•Microsoft Word (.doc, .docx)
•OpenOffice presentation (.odp)
•OpenOffice spreadsheet (.ods)
•OpenOffice text (.odt)
•Rich Text Format (.rtf)
•Scalable Vector Graphics (.svg)
•TeX/LaTeX (.tex)
•Text (.txt, .text, other file extensions), including
source code in common programming languages:
• Basic source code (.bas)
• C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
• C# source code (.cs)
• Java source code (.java)
• Perl source code (.pl)
• Python source code (.py)
•Wireless Markup Language (.wml, .wap)
•XML (.xml)
@patrickstox @ahrefs #pubcon
Image Processing
• Text around the image
• Content of the image. They tag what is in the image. Not super reliable.
• Alt attribute
• Image name (minimal weight)
• Webpage title and description
Photo from a Gary Illyes Presentation
at Pubcon.
@patrickstox @ahrefs #pubcon
Video Processing
• OCR to get text
• Objects identified from visuals
• Speech converted to text
• Structured data
• Text and other signals from the page, URL, title, description
@patrickstox @ahrefs #pubcon
PDFs
• PDFs are converted and indexed as HTML
• OCR to get text
• Images get indexed
• Links get picked up
• Title
• File name
• …
@patrickstox @ahrefs #pubcon
Data Infrastructure
Many data centers around the world.
Each has a copy of the index.
Millions of servers and hard drives.
Index is an inverted index.
Maps things like words to documents.
Index shards are split into words and phrases.
Other shards for metadata.
@patrickstox @ahrefs #pubcon
Start Typing - Autocomplete
Powered by real search data
and patterns across the web +
• The language of the query
• The location a query is coming from
• Trending interest in a query
• Your past searches
Probably reduces misspellings
@patrickstox @ahrefs #pubcon
Query parsing and understanding
BERT (DeepRank) – combinations of words express different
meanings and intents. They won’t drop important words from
the queries.
Neural matching – words to searches.
“For example, neural matching helps Google understand that a
search for “why does my TV look strange” is related to the
concept of “the soap opera effect.” We can then return pages
about the soap opera effect, even if the exact words aren’t used.”
@patrickstox @ahrefs #pubcon
Google Training Misspelling Example
Over 600 ways people misspelled Britney Spears.
http://archive.google.com/jobs/britney.html
@patrickstox @ahrefs #pubcon
Spelling Old Vs New
Old way:
How often terms were searched
+probability of typos from neighboring keys
New way:
Deep neural net with 680M parameters
@patrickstox @ahrefs #pubcon
Query Expansion
When the query is sent, it’s going to also pull pages with terms that include:
• Synonyms
• Antonyms
• Acronyms
• Plural/singular
• Stemming – root words
• Diacritical expansion - accent characters other versions
These will mostly get lower weights in scoring than the main term used.
@patrickstox @ahrefs #pubcon
Concepts & Entities
People, places, things
“RankBrain helps Google better relate pages to concepts – This
means Google can better return relevant pages even if they
don’t contain the exact words used in a search, by
understanding the page is related to other words and
concepts.”
@patrickstox @ahrefs #pubcon
Stop Words
The, is, and, of, a, are, an, if, etc.
Removed for some queries.
Used for other queries, like when it matches a concept.
@patrickstox @ahrefs #pubcon
Make A Smaller List - Ranking
Google is going to cut all those results down to the top 1000 by ranking them.
@patrickstox @ahrefs #pubcon
Ranking / Scoring – Query Dependent
Feature of a page & query
• Keyword hits
• All those other versions from the query expansion like synonyms
• Proximity
• Content relevance, topicality
• …
@patrickstox @ahrefs #pubcon
Ranking / Scoring – Query Independent
Feature of a page
• PageRank, site queries, mentions,
& other E-E-A-T signals
• Language
• Mobile-friendliness
• Page experience
• …
Numbers multiplied by other numbers in the scoring
@patrickstox @ahrefs #pubcon
Reranking / Post-Retrieval Adjustments
Has a smaller number of results - 1000
With the smaller number, they can run more intelligent but resource intensive
systems to re-order the results.
@patrickstox @ahrefs #pubcon
RankBrain & BERT - Again
“Based on its complex language understanding, BERT can very quickly rank
documents for relevance.”
Depending on the search, Google’s algorithm can use either RankBrain, BERT,
or both.
@patrickstox @ahrefs #pubcon
Host Clustering
Limits the results you see from the same domain.
Add &filter=0 to your search URL to see unfiltered results.