How Google Really Analyzes Content for Search

The New Content SEO
FLOQ - Amanda King
Sydney SEO Conference
14 April 2023

The New Content
SEO
What we’ll talk about
1. A quick refresher
2. Have keywords ever actually
been a thing Google used?
3. How Google reads content
may not be what you think
4. So what do we do about all
this?
5. Who tf am I?

A brief refresher on how Google crawls the Internet
It’s three separate stages: crawl,
index, serve; with sub-processes
for scoring and ranking.
Content analysis is included in the
indexing engine, content relevancy
is in the serving engine.
While this is an old patent (2011) the
fundamentals still apply for this
reminder.
Source: https://patents.google.com/patent/US8572075B1/, retrieved 22 Mar 2023
https://developers.google.com/search/docs/fundamentals/how-search-works

● Query Deserves Freshness is a system
● Helpful Content is a system
● MUM & BERT are systems
○ “Bidirectional Encoder Representations from
Transformers (BERT) is an AI system Google uses
that allows us to understand how combinations of
words express different meanings and intent.”
The search engine ranking engine works
in systems
https://developers.google.com/search/docs/appearance/ranking-systems-guide

Have keywords ever actually been
a thing Google used?

While Google is a
machine, it’s moved
fundamentally beyond
keywords…and has since
at least 2015.

Queries very quickly
become entities
“[...]identifying queries in query data;
determining, in each of the queries,
(i) an entity-descriptive portion that
refers to an entity and (ii) a suffix;
determining a count of a number of
times the one or more queries were
submitted“
- patent granted in 2015, submitted in
2012
Source: https://patents.google.com/patent/US9047278B1/en ; https://patents.google.com/patent/US20150161127A1/

Google acknowledges query-only based
matching is pretty terrible.
“Direct “Boolean” matching of query terms has well known limitations,
and in particular does not identify documents that do not have the query
terms, but have related words [...]The problem here is that conventional
systems index documents based on individual terms, rather than on
concepts. Concepts are often expressed in phrases [...] Accordingly,
there is a need for an information retrieval system and methodology that
can comprehensively identify phrases in a large scale corpus, index
documents according to phrases, search and rank documents in
accordance with their phrases, and provide additional clustering and
descriptive information about the documents. [...]”
- Information retrieval system for archiving multiple document
versions, granted 2017 (link)

So it decided to make it’s search engine
concept and phrase-based.
“The system is adapted to identify phrases that have
sufficiently frequent and/or distinguished usage in the
document collection to indicate that they are “valid” or “good”
phrases [...]The system is further adapted to identify phrases
that are related to each other, based on a phrase's ability to
predict the presence of other phrases in a document.”
- Information retrieval system for archiving multiple
document versions, granted 2017 (link)

“Rather than simply
searching for content that
matches individual words,
BERT comprehends how a
combination of words
expresses a complex idea.”
Source: https://blog.google/products/search/how-ai-powers-great-search-results/

MUM takes this a step further
● About 1,000 times more powerful than BERT
● Trained across 75 languages for greater context
● Recognises this across different types of media (video,
text, etc)
https://blog.google/products/search/introducing-mum/

How Google reads content may
not be what you think

Step 1
Indexing
Indexing is the stage where content
is analysed, so how does Google
do it?

BERT is a technique for
pre-training natural
language classification. So
how does natural language
processing work, once it
has a corpus of data?
Source: https://blog.google/products/search/search-language-understanding-bert/

Is there anything in this process that even looks like “keywords”?

1. Parsing: Tokenisation, parts of speech, stemming
(for Google, lemmatization)
2. Topic Modelling: entity detection, relation detection
3. Understanding
4. Onto the next engine, ranking
So the broad strokes steps in the
indexation process are

● Semantic distance
● Keyword-seed affinity
● Category-seed affinity
● Category-seed affinity to
threshold
Parsing is intrinsically
categorisation
https://patents.google.com/patent/US11106712B2; https://www.seobythesea.com/2021/09/semantic-relevance-of-keywords/

How natural language processing usually works: tokenization and subwords
Source: https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html

● N-grams: important to find the
primary concepts of the
sentence by identifying and
excluding stop words
● “Running” “runs” “ran” = same
base — “run”
This gets broken down even
further
https://patents.google.com/patent/US8423350B1/

Google does a lot of things when detecting
entities and relationships
● Identifying aspects to define entities based on popularity
and diversity, granted in 2011 (link)
● Finding the entity associated with a query before returning
a result, using input from human quality raters to confirm
objective fact associated with an entity, granted in 2015
(link)
● Understanding the context of the query, entity and related
answer you’re searching for, granted in 2019 (link)
● Aims to understand user generated content signals in
relation to a webpage, granted in 2022 (link)

Google does a lot of things when detecting
entities and relationships
● Understanding the best way to present an entity in a
results page, granted in 2016 (link)
● Managing and identifying disambiguation in entities,
granted in 2016 (link)
● Build entities through co-occurring ”methodology based
on phrases” and store lower information gain
documents in a secondary index, granted in 2020 (link)
● Understanding context from previous query results and
behaviour, granted in 2016 (link)

Step 2
Scoring
In their own description of their
ranking & scoring engine, Google
offers 5 buckets:
● Meaning
● Relevance
● Quality
● Usability
● Context

Scoring is all those 200+ factors we talk
about…
Google has cited everything from internal links, external links, pogo sticking, “user
behaviour”, proximity of the query terms to each other, context, attributes, and more
Just a few of the patents related to scoring:
● Evaluating quality based on neighbor features (link)
● Entity confidence (link)
● Search operation adjustment and re-scoring (link)
● Evaluating website properties by partitioning user feedback (link)
● Providing result-based query suggestions (link)
● Multi-process scoring (link)
● Block spam blog posts with “low link-based score” (link)

It actually looks like
they have a
classification engine
for entities as well
This patent was filed in 2010,
granted in 2014. Likely a basis
for the Knowledge Graph.
(US8838587B1)
https://patents.google.com/patent/US8838587B1/en

“...link structure may be
unavailable, unreliable, or
limited in scope, thus,
limiting the value of using
PageRank in ascertaining
the relative quality of some
documents.” (circa 2005)

There’s more than one document scoring function, which are weighted, and has been since the beginning

How Google ranks content
● Based on historical behaviour from similar searches in
aggregate (application)
● Based on external links (link)
● Based on your own previous searches (link)
● Based on or not it should directly provide the answer via
Knowledge Graph (link)
● Phrase- and entity-based co-occurrence threshold
scores (link)
● Understanding intent based on contextual information
(link)

Helpful Content Update & Information
Gain Score (granted Jun 2022)
● The information gain score might be personal to you
and the results you’ve already seen
● Featured snippets may be different from one search to
another based on the information gain score of your
second search
● Pre-training a ML model on a first set of data shown to
users in aggregate, getting an information gain score,
and using that to generate new results in SERPs.
https://patents.google.com/patent/US20200349181A1/en

What is “information gain”?
“Information gain, as the ratio of actual co-occurrence rate to
expected co-occurrence rate, is one such prediction
measure. Two phrases are related where the prediction
measure exceeds a predetermined threshold. In that case,
the second phrase has significant information gain with
respect to the first phrase.“
- Phrase-based searching in an information retrieval
system, granted 2009 (link)

So, basically, it’s
quantifying to what
degree you talk about all
the topics Google sees as
related to your main
subject.

If information gain is such a
strong concept in which
results Google chooses
which content to show, why
do so few folks talk about it?

So what do we do about all this?

When is the last time
you’ve done a full
content inventory?

What I mean when I say content inventory
https://www.portent.com/onetrick

Redo keyword research and overlay
entities
● Pull content for at least the top 10 search results
ranking for your target keyword
● Dump them into Diffbot (https://demo.nl.diffbot.com/) or
the Natural Language AI demo
(https://cloud.google.com/natural-language)
● Note the entities and salience
● Run your target page
● Understand the differences
● Update your content accordingly

Start with keyword research, find co-
occuring terms
● Pull content for at least the top 10 search results
ranking for your target keyword
● Look at TF-IDF calculators to reverse engineer the topic
correlation (Ryte has a paid one)
● Note the terms included
● Run your target page
● Understand the differences
● Update your content accordingly

Break old content habits
● FAQ on product pages
● Consolidate super-granularly targeted blog articles
● Think outside of the blog folder — the semantic
relationship can carry through to the directory order of
the website as well
● Internal linking can be a secret weapon
● Fit content to purpose: not everything needs a 3,000
word in-depth article

Measure what really
matters to the business
— traffic and revenue
from organic.

Amanda King is a human
● Over a decade in the
SEO industry
● Traveled to 40+
countries
● Business- and
product-focussed
● Knows CRO, Data,
UX
● Always open to
learning something
new
● Slightly obsessed
with tea

Thank you
Amanda King
t. @amandaecking
i. @floq.co / @amandaecking
w. floq.co

How Google Really Analyzes Content for Search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Google Really Analyzes Content for Search

Similar to How Google Really Analyzes Content for Search (20)

Recently uploaded

Recently uploaded (20)

How Google Really Analyzes Content for Search

Editor's Notes