Advanced data-driven technical SEO - SMX London 2019

@basgr from @peakaceag1
Bastian Grimm, Peak Ace AG | @basgr
Merging your logfiles, GA, GSC & web crawl data
for better SEO insights
Advanced data-driven technical SEO

And why are log files important for your SEO work?
Why should you care?

pa.ag@peakaceag @basgr from @peakaceag3
I am a big fan of the various crawling tools, but…
It’s only the access log files that demonstrate
how a search engine’s crawler is behaving on
your site; all crawling tools are simply trying to
simulate their behaviour!

You need to see which pages are being prioritised by the search
engines and should therefore be considered the most important
1. Understand crawl priorities

Google may reduce its crawling behaviour/frequency & eventually rank
you lower if you are constantly providing a large amount of errors
2. Prevent reduced crawling

It’s essential to identify any crawl shortcomings
(such as hierarchy or internal link structure)
with potential site-wide implications
3. Understand global issues

You need to ensure that Google crawls everything important:
primarily ranking, relevant content, but also fresh & older
items
4. Ensure proper crawling

It’s important to ensure that any gained link equity will
always be passed using proper links and/or redirects
5. Ensure proper linking

Keep in mind, details depend on the individual setup!
The characteristics of a log file

…depending on your webserver (Apache, nginx, IIS, etc.), caching
and its configuration. Make sure to understand your setup first!
Content & structure can vary…

What does a log file usually look like?
Server IP/host name1
Timestamp (date & time)2
Method (GET/POST/HEAD)3
Request URL4
HTTP status code5
Size in bytes6
Referrer7
User-agent8
188.65.114.xxx [21/May/2019:02:00:00 -0100]
/resources/whitepapers/seo-whitepaper/
HTTP/1.1" 200 512 "http://www.wikipedia.org/"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
"GET

Log file data can be quite overwhelming because you can do so
many different things; make sure you’ve got
your questions prepared!
You need to ask the right questions!

Log file data can be different e.g. to Google Analytics data
While log files are direct, server-side pieces of information, Google Analytics uses client-
side code. As the data sets are coming from two different sources, they can be
different!
The configuration within Google Analytics also leads to data differences when
compared to the log files, i.e. filters!

Be cautious when requesting log files from your clients
Frequently asked questions

We only care about crawlers such as Google and Bing; no need for any
user data (operating system, browser, phone number, usernames, etc.)
1. Personal information in logs?

If you are running a cache server and/or a CDN which
creates logs elsewhere, we will also need these logs
2. Separate multi-location logs?

There are different ways you could approach this:
Log file auditing tools

There are different ways you could approach this:

Do-it-yourself solution based on Excel
You’d have to manually build filtering, cross-references, etc. – it just doesn’t scale!

Screaming Frog Log File Analyser
Beginners’ level, desktop-based log file auditing with pre-defined reports.

No sharing capabilities, log files need to be manually up/downloaded,
which is usually problematic for larger files, etc.
Desktop solutions are limited

Splunk or Sumo Logic: proprietary, paid software solutions
Enterprise tools such as Splunk usually come with a hefty (volume-based) price tag.
In all fairness though: these solutions offer features way beyond log file monitoring!
Image sources: https://pa.ag/2srgTZu (splunk) & https://pa.ag/2JcuiLt (sumologic)

The Elastic Stack (ELK): Elasticsearch, Logstash & Kibana
Elasticsearch: search & analytics engine, Logstash: server-side data processing
pipeline, Kibana: data visualisation (charts, graphs, etc.) – all open source.
Image source: https://pa.ag/2JbFUhP

Other SaaS solutions: logrunner.io, logz.io (ELK) & Loggly
Especially logrunner.io, which has a strong focus on SEO-based auditing (dashboards
etc.).

crawlOPTIMIZER: SaaS Logfile Auditing, made in Vienna
BRPs (Business Relevant Pages) with dedicated evaluations of these as top USP.

No messing around with exports, up/downloads, easy sharing
capabilities and the ability to deal with massive volumes, etc.
The beauty of SaaS: almost real time

For an easy start: trend monitoring (over time) & gathering insights
Let’s have a look at some data

Most obvious approach: spotting anomalies vs. time frame
Tip: this is why it makes a lot of sense to check your log files regularly (e.g. daily).
This looks unusual; take it
as a starting point for
further investigation.

User crawling frequencies over time
Understanding patterns and irregularities can be very helpful - always look at the crawl
behaviour of individual users over time.

Use log files to look for spam bots or scrapers to block!
What other ”bots“ access your site?

Not everyone is who they claim to be!
The easiest way to detect if Googlebot really is Googlebot: run a reverse DNS lookup.
Bingbot can also be verified via *.search.msn.com.
Source: https://pa.ag/2JqOk8d

What are the most crawled Googlebot pages?
Also, verify if they coincide with your domains’ most important ones.
Understand if these are really
your most valuable pages?

Breakdown of crawl requests & status codes per directory
You’d easily see if one of your main directories encountered crawling/response issues.
Tip: establish this on a regular basis to ensure continued performance of top directories.

And respective actions based on those findings
Advanced auditing for SEO

1. Redirects

Identify any kind of ”wrong“ redirect: 302/304/307/308
Action: change to 301 (except geo redirects); also watch out for redirect chains!
Investigate further to
see what’s in there

2. Crawl errors

4xx client errors: too many are a sign of poor site health
Action: recover (200), redirect (301) or kill off entirely (410)

Googlebot can‘t login… (403: forbidden)
If it‘s linked, Google will try to crawl it – they are greedy!

5xx server errors: usually infrastructure-related
Action: watch closely and/or talk to IT (server availability, high load, etc.)
Check consistency; what
happens when re-trying?

3. Crawl priority

Understanding the most/least crawled URLs and folders
Action: highly crawled pages/folders could be used e.g. for additional internal linking
(add link hubs), low crawled areas need to be linked more prominently.
Can be used for additional, internal linking (improve
discovery of other content)
Clearly weak, either irrelevant (remove) or requires
more attention

4. Last crawled

Investigate if (new) URLs have been crawled at all
Action: if relevant URLs haven’t been discovered/crawled at all, your internal linking is
probably too weak. Consider XML sitemaps, better/more prominent linking, etc.
If these are important URLs,
you might have a problem!

5. Crawl waste

I‘m sure you‘ve all seen this?
Source: https://pa.ag/2LUnt2R

This is what the Google Webmaster Central blog says:
Source: https://pa.ag/2HhsYoz
Wasting server resources on pages […] will
drain crawl activity from pages that do actually
have value, which may cause a significant
delay in discovering great content on a site.

If you have ever had to deal with sites like these…
Properly dealing with >30,000,000 crawlable URLs (due to parameter usage) certainly
makes a difference in organic performance!

URL parameters cause most problems
(Combined) URL parameters often generate millions of unnecessary URLs, especially for
large domains, which Googlebot diligently crawls (once found).

URL parameter behaviour over time
Constantly be on the lookout for new parameters as well as significantly increased
crawling for known parameters.

A brief overview: #SMXInsights
01
No one-size-fits-all
solution
Log file size, quantity
& availability are all
decisive with regards
to tool selection.
02
Preparation is key
Concrete questions
help to generate
efficient analysis.
03
Crawl data only
Be precise with your
requests (to the IT
department) - you
just want to know
what the search
engines are doing!
04
Reverse DNS use
Not every crawler is
who they pretend to
be - do not "blindly"
trust in the user-agent
string
05
URL parameters
These are almost
always the biggest
problem
(combinations, order,
consistency) - audit
them first.

Oh yeah, there’s one more thing …

I want: no IT involvement, unlimited scalability, flexible reporting, multiple
(API) data sources and ease of use!
There's got to be another way!

(And everyone at #SMX gets this as a gift - for free!)
We've thought of something:

Say hello to the Peak Ace log file auditing stack
Log files are stored in Google Cloud Storage, processed in Dataprep, exported to BigQuery and
visualised in Data Studio via the BigQuery Connector.
Google Data Studio
Data
transmission
Display
dataImport
Google Dataprep Google BigQuery
1
Log files
GSC
API v3
GA
API v4
GA
GSC
2
3
65
Google Apps Script
DeepCrawl
API
4
86 7

Individual reports, tailored to your needs
And what do the results look like?

Connect and conquer…
How does it work?

#1 Log file data from web servers, CDN, cache, etc.
How often do bots actually crawl? What do they crawl and when?
Source: https://pa.ag/2zs9lcY
Goal: improve site architecture by
analysing real bot crawling data.
▪ Amount of crawls/requests by bot type
▪ Identification of crawling patterns
▪ Overview of errors
▪ 3xx
▪ 4xx
▪ 5xx
Log filesGoogle Cloud
Storage
Import as text files
(exclude IP addresses!)

15TB (per one file) to be pushed in Big Query
Size is absolutely NOT an issue

nginx / Apache / etc. >> fluentd >> Big Query
Stand-alone files are messy, agreed.

#2 Google Analytics API
Enrich reports with traffic, engagement, behavioural and page speed data
Goal: compare crawling behaviour with user & loading
time data.
URL-based data on important engagement metrics:
▪ Sessions
▪ Users
▪ Bounce rate
▪ Session duration
▪ Avg. time on page
▪ Avg. server response time
▪ Avg. page load time
▪ …
Google Analytics
Reporting API v4

#3 Google Search Console API
Organic search performance data directly from Google
Goal: compare crawling behaviour with organic click
data & e.g. retrieve reported crawling errors.
Organic click data
▪ Clicks
▪ Impressions
▪ Device
▪ …
URL-based server response data
▪ Status code
Google Search
Console API v3

#4 DeepCrawl API
Website architecture, status codes, indexing directives, etc.
Goal: capture indexing directives, response codes and
more.
DeepCrawl
API

#5 Google Apps Scripts for GA, GSC & DeepCrawl
API access: capture multiple dimensions and metrics from GA, retrieve GSC crawl and
search analysis data and DeepCrawl crawl & analysis data
Source: https://pa.ag/2OWnjJa
Goal: send data (via/from the respective API) to BigQuery
and store the data there.
Google Apps Script

#6 Google Cloud Dataprep
Clean and process the data. Afterwards, combine these various sources with
several joins so that they‘re ready for visualisation.
Source: https://pa.ag/2Q6rEde
Goal: combine data from log files, GSC, GA & DeepCrawl
within/by using processing flows.
Dataprep: “Excel with super rocket fuel“
▪ Amazing RegEx support
▪ Select data, receive automated
proposals for processing
▪ Join data sources by e.g. full
inner/outer join, left/right outer join…
Google Apps Script

And use Google Data Studio to visualise:
Save everything to BigQuery

Log file auditing is not a project, but a process!
Integrate log file auditing into your regular
SEO workflow; one-off audits are good to
begin with, but they really become invaluable
if you combine them with web crawl data and
perform them on an on-going basis.

bg@pa.ag
Slides? No problem:
https://pa.ag/smxl19logs
You want our log file setup (for free)?
e-mail us > hi@pa.ag
Bastian Grimm
twitter.com/peakaceag
facebook.com/peakaceag
www.pa.ag
ALWAYS LOOKING FOR TALENT! CHECK OUT JOBS.PA.AG
WINNER

Advanced data-driven technical SEO - SMX London 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Advanced data-driven technical SEO - SMX London 2019

Similar to Advanced data-driven technical SEO - SMX London 2019 (20)

More from Bastian Grimm

More from Bastian Grimm (17)

Recently uploaded

Recently uploaded (20)

Advanced data-driven technical SEO - SMX London 2019