Machine Learning Shaping Google and Technical SEO

JR Oakes | @jroakes #TechSEOBoost
Fun with Machines. How Machine
Learning is Shaping Google and
Technical SEO

About Me
• Studied Industrial Design at NCSU
• Worked as an architectural glass
artist for 10 years.
• Was Lead Developer and then
Director of Strategy for medium-
sized agency with 100+ clients
worldwide.
• Work as Director, Technical SEO
for Adapt.

I have a problem with tf-idf

About TF-IDF
TF-IDF is very hand-wavy and
sounds very fancy, but is not the
magic elixir to DOMINATING ON
GOOGLE.

About TF-IDF
It is actually not even the best IR
algorithm.
BM25 takes into account document
length in addition to other factors in
various iterations.

About TF-IDF
https://wikimedia-research.github.io/Discovery-Search-Test-BM25/
https://wikimedia-research.github.io/Discovery-Search-Test-InterleavedLTR/

About TF-IDF
The Search Platform Team has been working on improving search on
Wikimedia projects with machine learning. Machine learned-ranking (MLR)
enables us to rank relevance of pages using a model trained on implicit and
explicit judgements. In the first test of the learning-to-rank (LTR) project, we
evaluated the performance of a click-based model on users searching English
Wikipedia. We found that users were slightly more likely to engage with MLR-
provided results than with BM25 results (assessed via the clickthrough rate
and a preference statistic). We also found that users with machine learning-
ranked results were statistically significantly more likely to click on the first
search result first than users with BM25-ranked results, which indicates that
we are onto something. The next step for us is to evaluate the model’s
performance on Wikipedia in other languages.

About TF-IDF

About TF-IDF
Wikimedia Research released their first model on Github last month.
MjoLniR – our Python and Spark-based library for handling the
backend data processing for Machine Learned Ranking at
Wikimedia.
https://github.com/wikimedia/search-MjoLniR/tree/master/mjolnir

About TF-IDF
We are WAY beyond TF-IDF. TF-IDF seems to work because it causes you to
look for related phrases, but it is not a very good relevance metric. It is a
keyword frequency metric.

How is Google Using Machine
Learning?

Was Larry Kim right?

CTR As A Ranking Factor

Potentially:
• Clicks - For our click model we use a generalization of the PositionBased
Model (PBM) [9], at the core of which lies an examination hypothesis,
stating that in order to be clicked a document has to be examined and
attractive:
• Attention – What if users get the information that they need directly from
the SERP (Answer boxes), without a click, how do we know they were
satisfied?
• Satisfaction – “While looking at the reasons specified by the raters we
found out that 42% of the raters who said that they would click through on
a SERP, indicated that their goal was “to confirm information already
present in the summary” So additional clicks don’t necessarily mean a
poor initial result.

Machine Learning in its simplest form takes:
1. Input features
2. An algorithm that processes the features (most often) in a linear, non-
linear, or tree-based way to make a prediction.
3. And an evaluation metric that compares the prediction to your “ground
truth” data.
It is technically possible that CTR and / or Quality Rater data provides the
ground truth.

The problem is:
We don’t have the ground truth, we don’t know the features, and we sure as
hell have no idea what is in here:

We know that it probably depends on:
• Click-through-rate
• Context models
• Ground-truth quality (Quality Rater’s Guidelines)
• And other standard factors.

Storytelling

Storytelling
Using Generative Adversarial Networks to train machines how to see the
storylines in news events.
https://www.ijcai.org/proceedings/2017/0554.pdf

LSTMs

LSTMs
We would also guess that LSTMs (with attention) play some role in Rankbrain
based on its state-of-the-art ability to pick up referential information in texts
well beyond traditional BOW models.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

What should we focus on?

Query Disambiguation

Very little information in the query and
a lot of information in the possible
results.

Google tries to give us a nudge.

What a strong hint to
consider when thinking about
what needs to be included
on a page discussing:
Lipton Tea
Also a very strong hint at
potential navigation.

AT&T does an amazing job
at this.

Semantic Relevance

Semantic Relevance
Bill Slawski (as always) is spot on.

Semantic Relevance
Going back to the patent from Google
in 2014 (Integrated external related
phrase information into a phrase-
based indexing information retrieval
system), we see that there is an
marked gain in the significance of
phrases in a page based on
additional semantically related
qualifying phrases.

Semantic Relevance
There are many ways to handle this on a
page level.

Semantic Relevance
But, this really starts much sooner by
trying to discover content / intent
categories that your site is relevant
for to even start the process of
building out relevant content
categories for your visitors.
https://anaconda.org/jroakes/cluster-
share/notebook

Semantic Relevance
The prior notebook ingests your
keywords, models them to vector
space, and then runs k-means to
group the keywords into relevance
clusters.

Semantic Relevance
Note this goes well beyond term-
frequency.

Semantic Relevance
Using skip-gram models impart
probability of cooccurrence across
large corpuses which is much closer
to what Google does than simple tf-
idf.

We should also care about click
satisfaction.

Click Satisfaction

Click Satisfaction
Working hard to ensure that your pages get the clicks. H/T to @fighto for the
excellent article here:
https://searchengineland.com/alert-abnormal-organic-ctr-detected-automatic-
detection-poorly-performing-meta-data-280290
https://anaconda.org/jroakes/ctr_anamolies_share/notebook

We should also care about content
deduplication.

Content Deduplication
https://anaconda.org/jroakes/duplicate_detection
_with_shingling_share/notebook

Wrapping Up
It is very difficult to gain intuition into how Google works based on solely external
data. The reality is that context, machine learning, and click data allows for the
building of models that humans cannot understand easily.
We wanted to move the conversation away from simplistic keyword mechanisms
and towards an understanding that there semantics and context are much more
valuable to ranking.

Machine Learning Shaping Google and Technical SEO

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Shaping Google and Technical SEO

Similar to Machine Learning Shaping Google and Technical SEO (20)

More from Catalyst

More from Catalyst (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Shaping Google and Technical SEO