This document summarizes Amanda King's presentation on the new content SEO at the Sydney SEO Conference. It discusses how Google has moved beyond keywords and now understands content semantically through natural language processing and systems like BERT. It also explains how Google analyzes content through parsing, entity detection, and understanding relationships to score and rank pages. The presentation recommends doing a full content inventory to identify entities, related terms, and differences from top ranking pages to update content accordingly.
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
How Google Really Analyzes Content for Search
1. The New Content SEO
FLOQ - Amanda King
Sydney SEO Conference
14 April 2023
2. The New Content
SEO
What we’ll talk about
1. A quick refresher
2. Have keywords ever actually
been a thing Google used?
3. How Google reads content
may not be what you think
4. So what do we do about all
this?
5. Who tf am I?
5. A brief refresher on how Google crawls the Internet
It’s three separate stages: crawl,
index, serve; with sub-processes
for scoring and ranking.
Content analysis is included in the
indexing engine, content relevancy
is in the serving engine.
While this is an old patent (2011) the
fundamentals still apply for this
reminder.
Source: https://patents.google.com/patent/US8572075B1/, retrieved 22 Mar 2023
https://developers.google.com/search/docs/fundamentals/how-search-works
6. ● Query Deserves Freshness is a system
● Helpful Content is a system
● MUM & BERT are systems
○ “Bidirectional Encoder Representations from
Transformers (BERT) is an AI system Google uses
that allows us to understand how combinations of
words express different meanings and intent.”
The search engine ranking engine works
in systems
https://developers.google.com/search/docs/appearance/ranking-systems-guide
10. Queries very quickly
become entities
“[...]identifying queries in query data;
determining, in each of the queries,
(i) an entity-descriptive portion that
refers to an entity and (ii) a suffix;
determining a count of a number of
times the one or more queries were
submitted“
- patent granted in 2015, submitted in
2012
Source: https://patents.google.com/patent/US9047278B1/en ; https://patents.google.com/patent/US20150161127A1/
11. Google acknowledges query-only based
matching is pretty terrible.
“Direct “Boolean” matching of query terms has well known limitations,
and in particular does not identify documents that do not have the query
terms, but have related words [...]The problem here is that conventional
systems index documents based on individual terms, rather than on
concepts. Concepts are often expressed in phrases [...] Accordingly,
there is a need for an information retrieval system and methodology that
can comprehensively identify phrases in a large scale corpus, index
documents according to phrases, search and rank documents in
accordance with their phrases, and provide additional clustering and
descriptive information about the documents. [...]”
- Information retrieval system for archiving multiple document
versions, granted 2017 (link)
12. So it decided to make it’s search engine
concept and phrase-based.
“The system is adapted to identify phrases that have
sufficiently frequent and/or distinguished usage in the
document collection to indicate that they are “valid” or “good”
phrases [...]The system is further adapted to identify phrases
that are related to each other, based on a phrase's ability to
predict the presence of other phrases in a document.”
- Information retrieval system for archiving multiple
document versions, granted 2017 (link)
13. “Rather than simply
searching for content that
matches individual words,
BERT comprehends how a
combination of words
expresses a complex idea.”
Source: https://blog.google/products/search/how-ai-powers-great-search-results/
14. MUM takes this a step further
● About 1,000 times more powerful than BERT
● Trained across 75 languages for greater context
● Recognises this across different types of media (video,
text, etc)
https://blog.google/products/search/introducing-mum/
17. BERT is a technique for
pre-training natural
language classification. So
how does natural language
processing work, once it
has a corpus of data?
Source: https://blog.google/products/search/search-language-understanding-bert/
19. 1. Parsing: Tokenisation, parts of speech, stemming
(for Google, lemmatization)
2. Topic Modelling: entity detection, relation detection
3. Understanding
4. Onto the next engine, ranking
So the broad strokes steps in the
indexation process are
21. How natural language processing usually works: tokenization and subwords
Source: https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html
22. ● N-grams: important to find the
primary concepts of the
sentence by identifying and
excluding stop words
● “Running” “runs” “ran” = same
base — “run”
This gets broken down even
further
https://patents.google.com/patent/US8423350B1/
23. Google does a lot of things when detecting
entities and relationships
● Identifying aspects to define entities based on popularity
and diversity, granted in 2011 (link)
● Finding the entity associated with a query before returning
a result, using input from human quality raters to confirm
objective fact associated with an entity, granted in 2015
(link)
● Understanding the context of the query, entity and related
answer you’re searching for, granted in 2019 (link)
● Aims to understand user generated content signals in
relation to a webpage, granted in 2022 (link)
24. Google does a lot of things when detecting
entities and relationships
● Understanding the best way to present an entity in a
results page, granted in 2016 (link)
● Managing and identifying disambiguation in entities,
granted in 2016 (link)
● Build entities through co-occurring ”methodology based
on phrases” and store lower information gain
documents in a secondary index, granted in 2020 (link)
● Understanding context from previous query results and
behaviour, granted in 2016 (link)
25. Step 2
Scoring
In their own description of their
ranking & scoring engine, Google
offers 5 buckets:
● Meaning
● Relevance
● Quality
● Usability
● Context
26. Scoring is all those 200+ factors we talk
about…
Google has cited everything from internal links, external links, pogo sticking, “user
behaviour”, proximity of the query terms to each other, context, attributes, and more
Just a few of the patents related to scoring:
● Evaluating quality based on neighbor features (link)
● Entity confidence (link)
● Search operation adjustment and re-scoring (link)
● Evaluating website properties by partitioning user feedback (link)
● Providing result-based query suggestions (link)
● Multi-process scoring (link)
● Block spam blog posts with “low link-based score” (link)
27. It actually looks like
they have a
classification engine
for entities as well
This patent was filed in 2010,
granted in 2014. Likely a basis
for the Knowledge Graph.
(US8838587B1)
https://patents.google.com/patent/US8838587B1/en
28. “...link structure may be
unavailable, unreliable, or
limited in scope, thus,
limiting the value of using
PageRank in ascertaining
the relative quality of some
documents.” (circa 2005)
https://patents.google.com/patent/US7962462B1/en
29. There’s more than one document scoring function, which are weighted, and has been since the beginning
30. How Google ranks content
● Based on historical behaviour from similar searches in
aggregate (application)
● Based on external links (link)
● Based on your own previous searches (link)
● Based on or not it should directly provide the answer via
Knowledge Graph (link)
● Phrase- and entity-based co-occurrence threshold
scores (link)
● Understanding intent based on contextual information
(link)
31. Helpful Content Update & Information
Gain Score (granted Jun 2022)
● The information gain score might be personal to you
and the results you’ve already seen
● Featured snippets may be different from one search to
another based on the information gain score of your
second search
● Pre-training a ML model on a first set of data shown to
users in aggregate, getting an information gain score,
and using that to generate new results in SERPs.
https://patents.google.com/patent/US20200349181A1/en
32. What is “information gain”?
“Information gain, as the ratio of actual co-occurrence rate to
expected co-occurrence rate, is one such prediction
measure. Two phrases are related where the prediction
measure exceeds a predetermined threshold. In that case,
the second phrase has significant information gain with
respect to the first phrase.“
- Phrase-based searching in an information retrieval
system, granted 2009 (link)
34. If information gain is such a
strong concept in which
results Google chooses
which content to show, why
do so few folks talk about it?
https://patents.google.com/patent/US7962462B1/en
36. When is the last time
you’ve done a full
content inventory?
37. What I mean when I say content inventory
https://www.portent.com/onetrick
38. Redo keyword research and overlay
entities
● Pull content for at least the top 10 search results
ranking for your target keyword
● Dump them into Diffbot (https://demo.nl.diffbot.com/) or
the Natural Language AI demo
(https://cloud.google.com/natural-language)
● Note the entities and salience
● Run your target page
● Understand the differences
● Update your content accordingly
39. Start with keyword research, find co-
occuring terms
● Pull content for at least the top 10 search results
ranking for your target keyword
● Look at TF-IDF calculators to reverse engineer the topic
correlation (Ryte has a paid one)
● Note the terms included
● Run your target page
● Understand the differences
● Update your content accordingly
40. Break old content habits
● FAQ on product pages
● Consolidate super-granularly targeted blog articles
● Think outside of the blog folder — the semantic
relationship can carry through to the directory order of
the website as well
● Internal linking can be a secret weapon
● Fit content to purpose: not everything needs a 3,000
word in-depth article
43. Amanda King is a human
● Over a decade in the
SEO industry
● Traveled to 40+
countries
● Business- and
product-focussed
● Knows CRO, Data,
UX
● Always open to
learning something
new
● Slightly obsessed
with tea
This is a lot of information and I don’t have all the answers - there’s a lot of patents and patent diving I’ve done, so if things get dry, I apologise. You can do a shot for every time I say “system” or “entity”.
Google is vector based: If search x goes to document a, and document a also contains term b, term b will be added to a list of associated topics for search x.
Original applied in 2005, granted in 2010: https://patents.google.com/patent/US7702618B1/en (Google really started to become popular in 2000)
Discussing how they would build their knowledge graph, essentially
Indexing system: 1) identification of phrases and related phrases, 2) indexing of documents with respect to phrases3) generation and maintenance of a phrase-based taxonomy.
co-occurrence matrix for the good phrases is maintained
If search x goes to document a, and document a also contains term b, term b will be added to a list of associated topics for search x.
third stage of the indexing operation is to prune the good phrase list using a predictive measure derived from the co-occurrence matrix
Unlike existing systems which use predetermined or hand selected phrases, the good phrase list reflects phrases that actual are being used in the corpus. Further, since the above process of crawling and indexing is repeated periodically as new documents are added to the document collection, the indexing system automatically detects new phrases as they enter the lexicon
The next step is to determine which related phrases together form a cluster of related phrases. A cluster is a set of related phrases in which each phrase has high information gain with respect to at least one other phrase. In one embodiment, clusters are identified as follows.
“ First, rather than a strictly—and often arbitrarily—defined hierarchy of topics and concepts, this approach recognizes that topics, as indicated by related phrases, form a complex graph of relationships, where some phrases are related to many other phrases, and some phrases have a more limited scope, and where the relationships can be mutual (each phrase predicts the other phrase) or one-directional (one phrase predicts the other, but not vice versa). The result is that clusters can be characterized “local” to each good phrase, and some clusters will then overlap by having one or more common related phrases.”
“The indexing of documents by phrases and use of the clustering information provides yet another advantage of the indexing system, which is the ability to determine the topics that a document is about based on the related phrase information.”
There’s also Palm, calm and lamda (one google engineer even claimed lamda was sentient)
This is where content analysis is included
BERT comes in during the topic modelling phase, it’s not the entirety of the indexation process.
Define corpus - the documents on the internet they can crawl
Remember natural language processing is not unique to Google. There are entire fields dedicated to it, it’s an entire branch of AI and computational linguistics.
The semantic distance between words can be estimated as the number of vertices that connect the two words.
Tokenisation is essentially converting a sentence into “tokens” to turn an unstructured string into elements that can be understood by machine learning. BERT has found shortcuts in the system of tokenisation through predictive modelling, matching and skipping, allowing the process to be about 5x faster than previous models to tokenise text.
Popularity score - search history frequency, click through rate, dwell time; diversity score is based on how similar the unranked document is to already ranked documents.
Based on historical behaviour from similar searches in aggregate (application)
“The system may also comprise a profile database that stores profiles associated with specific remote devices for use by the results ranker in ordering the categories. In addition, the system may comprise a relevance filter that stores data about other search queries received from other remote devices, the data including distributions of previously determined correlations between the other search queries and one or more different categories of information.”
Image 8
Based on your own previous searches (link)
How quickly you went from choosing one result to another
Whether or not you go back to the same source multiple times over time
Whether you choose a particular result more than the general population
Your declared demographics
Your declared location (link)
If you’ve made a bunch of the same types of searches (weather in britain, weather in spain), “sibling scores” (link)
Whether or not it should directly provide the answer via Knowledge Graph (link)
Whether or not it should have a zero result with a quick fact (link)
Whether or not text or another presentation of information makes sense (link)
Whether or not to return a “card”, like for movies showing at a particular theatre (link)
Raising the threshold over 1.0 serves to reduce the possibility that two otherwise unrelated phrases co-occur more than randomly predicted
Don’t have the answer for you there, I just like posing rhetorical questions.
This process is manual, but hopefully before the end of the financial year I’ll have a more automated process you can steal
What is entity salience? entity salience refers to the prominence of an entity within the content. Entity research and entity salience tell you what people who are ranking are talking about; co-occuring terms tell you what google is expecting folks to talk about — sometimes there’s a gap.
Google uses TF-IDF to assign terms to an entity, amongst many other things. https://patents.google.com/patent/US8589399B1/en
So why don’t we use TF-IDF to reverse engineer that?
This isn’t about keyword density
Adding FAQ (ongoing) leading indicators strong with product pages with 83% more traffic YoY than overall product category in organic (-1.7% v -10% YoY)
Blog consolidation: redirected about 60% of blog content - maintained traffic parity with overall organic traffic to the website: win for the business (less overhead)
Thinking outside the blog folder: Optus — 24% uplift in conversion when content was a part of the user journey