SlideShare a Scribd company logo
1 of 75
@basgr from @peakaceag1
Bastian Grimm, Peak Ace AG | @basgr
Merging your logfiles, GA, GSC & web crawl data
for better SEO insights
Advanced data-driven technical SEO
@basgr from @peakaceag2
And why are log files important for your SEO work?
Why should you care?
pa.ag@peakaceag @basgr from @peakaceag3
I am a big fan of the various crawling tools, but…
It’s only the access log files that demonstrate
how a search engine’s crawler is behaving on
your site; all crawling tools are simply trying to
simulate their behaviour!
@basgr from @peakaceag4
You need to see which pages are being prioritised by the search
engines and should therefore be considered the most important
1. Understand crawl priorities
@basgr from @peakaceag5
Google may reduce its crawling behaviour/frequency & eventually rank
you lower if you are constantly providing a large amount of errors
2. Prevent reduced crawling
@basgr from @peakaceag6
It’s essential to identify any crawl shortcomings
(such as hierarchy or internal link structure)
with potential site-wide implications
3. Understand global issues
@basgr from @peakaceag7
You need to ensure that Google crawls everything important:
primarily ranking, relevant content, but also fresh & older
items
4. Ensure proper crawling
@basgr from @peakaceag8
It’s important to ensure that any gained link equity will
always be passed using proper links and/or redirects
5. Ensure proper linking
@basgr from @peakaceag9
Keep in mind, details depend on the individual setup!
The characteristics of a log file
@basgr from @peakaceag10
…depending on your webserver (Apache, nginx, IIS, etc.), caching
and its configuration. Make sure to understand your setup first!
Content & structure can vary…
pa.ag@peakaceag @basgr from @peakaceag11
What does a log file usually look like?
Server IP/host name1
Timestamp (date & time)2
Method (GET/POST/HEAD)3
Request URL4
HTTP status code5
Size in bytes6
Referrer7
User-agent8
188.65.114.xxx [21/May/2019:02:00:00 -0100]
/resources/whitepapers/seo-whitepaper/
HTTP/1.1" 200 512 "http://www.wikipedia.org/"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
"GET
@basgr from @peakaceag12
Log file data can be quite overwhelming because you can do so
many different things; make sure you’ve got
your questions prepared!
You need to ask the right questions!
pa.ag@peakaceag @basgr from @peakaceag13
Log file data can be different e.g. to Google Analytics data
While log files are direct, server-side pieces of information, Google Analytics uses client-
side code. As the data sets are coming from two different sources, they can be
different!
The configuration within Google Analytics also leads to data differences when
compared to the log files, i.e. filters!
@basgr from @peakaceag14
Be cautious when requesting log files from your clients
Frequently asked questions
@basgr from @peakaceag15
We only care about crawlers such as Google and Bing; no need for any
user data (operating system, browser, phone number, usernames, etc.)
1. Personal information in logs?
@basgr from @peakaceag16
If you are running a cache server and/or a CDN which
creates logs elsewhere, we will also need these logs
2. Separate multi-location logs?
@basgr from @peakaceag17
There are different ways you could approach this:
Log file auditing tools
pa.ag@peakaceag @basgr from @peakaceag18
There are different ways you could approach this:
pa.ag@peakaceag @basgr from @peakaceag19
Do-it-yourself solution based on Excel
You’d have to manually build filtering, cross-references, etc. – it just doesn’t scale!
pa.ag@peakaceag @basgr from @peakaceag20
Screaming Frog Log File Analyser
Beginners’ level, desktop-based log file auditing with pre-defined reports.
@basgr from @peakaceag21
No sharing capabilities, log files need to be manually up/downloaded,
which is usually problematic for larger files, etc.
Desktop solutions are limited
pa.ag@peakaceag @basgr from @peakaceag22
Splunk or Sumo Logic: proprietary, paid software solutions
Enterprise tools such as Splunk usually come with a hefty (volume-based) price tag.
In all fairness though: these solutions offer features way beyond log file monitoring!
Image sources: https://pa.ag/2srgTZu (splunk) & https://pa.ag/2JcuiLt (sumologic)
pa.ag@peakaceag @basgr from @peakaceag23
The Elastic Stack (ELK): Elasticsearch, Logstash & Kibana
Elasticsearch: search & analytics engine, Logstash: server-side data processing
pipeline, Kibana: data visualisation (charts, graphs, etc.) – all open source.
Image source: https://pa.ag/2JbFUhP
pa.ag@peakaceag @basgr from @peakaceag24
Other SaaS solutions: logrunner.io, logz.io (ELK) & Loggly
Especially logrunner.io, which has a strong focus on SEO-based auditing (dashboards
etc.).
pa.ag@peakaceag @basgr from @peakaceag25
crawlOPTIMIZER: SaaS Logfile Auditing, made in Vienna
BRPs (Business Relevant Pages) with dedicated evaluations of these as top USP.
@basgr from @peakaceag26
No messing around with exports, up/downloads, easy sharing
capabilities and the ability to deal with massive volumes, etc.
The beauty of SaaS: almost real time
@basgr from @peakaceag27
For an easy start: trend monitoring (over time) & gathering insights
Let’s have a look at some data
pa.ag@peakaceag @basgr from @peakaceag28
Most obvious approach: spotting anomalies vs. time frame
Tip: this is why it makes a lot of sense to check your log files regularly (e.g. daily).
This looks unusual; take it
as a starting point for
further investigation.
pa.ag@peakaceag @basgr from @peakaceag29
User crawling frequencies over time
Understanding patterns and irregularities can be very helpful - always look at the crawl
behaviour of individual users over time.
@basgr from @peakaceag30
Use log files to look for spam bots or scrapers to block!
What other ”bots“ access your site?
pa.ag@peakaceag @basgr from @peakaceag31
Not everyone is who they claim to be!
The easiest way to detect if Googlebot really is Googlebot: run a reverse DNS lookup.
Bingbot can also be verified via *.search.msn.com.
Source: https://pa.ag/2JqOk8d
pa.ag@peakaceag @basgr from @peakaceag32
What are the most crawled Googlebot pages?
Also, verify if they coincide with your domains’ most important ones.
Understand if these are really
your most valuable pages?
pa.ag@peakaceag @basgr from @peakaceag33
Breakdown of crawl requests & status codes per directory
You’d easily see if one of your main directories encountered crawling/response issues.
Tip: establish this on a regular basis to ensure continued performance of top directories.
@basgr from @peakaceag34
And respective actions based on those findings
Advanced auditing for SEO
@basgr from @peakaceag35
1. Redirects
pa.ag@peakaceag @basgr from @peakaceag36
Identify any kind of ”wrong“ redirect: 302/304/307/308
Action: change to 301 (except geo redirects); also watch out for redirect chains!
Investigate further to
see what’s in there
@basgr from @peakaceag37
2. Crawl errors
pa.ag@peakaceag @basgr from @peakaceag38
4xx client errors: too many are a sign of poor site health
Action: recover (200), redirect (301) or kill off entirely (410)
pa.ag@peakaceag @basgr from @peakaceag39
Googlebot can‘t login… (403: forbidden)
If it‘s linked, Google will try to crawl it – they are greedy!
pa.ag@peakaceag @basgr from @peakaceag40
5xx server errors: usually infrastructure-related
Action: watch closely and/or talk to IT (server availability, high load, etc.)
Check consistency; what
happens when re-trying?
@basgr from @peakaceag41
3. Crawl priority
pa.ag@peakaceag @basgr from @peakaceag42
Understanding the most/least crawled URLs and folders
Action: highly crawled pages/folders could be used e.g. for additional internal linking
(add link hubs), low crawled areas need to be linked more prominently.
Can be used for additional, internal linking (improve
discovery of other content)
Clearly weak, either irrelevant (remove) or requires
more attention
@basgr from @peakaceag43
4. Last crawled
pa.ag@peakaceag @basgr from @peakaceag44
Investigate if (new) URLs have been crawled at all
Action: if relevant URLs haven’t been discovered/crawled at all, your internal linking is
probably too weak. Consider XML sitemaps, better/more prominent linking, etc.
If these are important URLs,
you might have a problem!
@basgr from @peakaceag45
5. Crawl waste
pa.ag@peakaceag @basgr from @peakaceag46
I‘m sure you‘ve all seen this?
Source: https://pa.ag/2LUnt2R
pa.ag@peakaceag @basgr from @peakaceag47
This is what the Google Webmaster Central blog says:
Source: https://pa.ag/2HhsYoz
Wasting server resources on pages […] will
drain crawl activity from pages that do actually
have value, which may cause a significant
delay in discovering great content on a site.
pa.ag@peakaceag @basgr from @peakaceag48
If you have ever had to deal with sites like these…
Properly dealing with >30,000,000 crawlable URLs (due to parameter usage) certainly
makes a difference in organic performance!
pa.ag@peakaceag @basgr from @peakaceag49
URL parameters cause most problems
(Combined) URL parameters often generate millions of unnecessary URLs, especially for
large domains, which Googlebot diligently crawls (once found).
pa.ag@peakaceag @basgr from @peakaceag50
URL parameter behaviour over time
Constantly be on the lookout for new parameters as well as significantly increased
crawling for known parameters.
pa.ag@peakaceag @basgr from @peakaceag51
A brief overview: #SMXInsights
01
No one-size-fits-all
solution
Log file size, quantity
& availability are all
decisive with regards
to tool selection.
02
Preparation is key
Concrete questions
help to generate
efficient analysis.
03
Crawl data only
Be precise with your
requests (to the IT
department) - you
just want to know
what the search
engines are doing!
04
Reverse DNS use
Not every crawler is
who they pretend to
be - do not "blindly"
trust in the user-agent
string
05
URL parameters
These are almost
always the biggest
problem
(combinations, order,
consistency) - audit
them first.
@basgr from @peakaceag52
Oh yeah, there’s one more thing …
@basgr from @peakaceag53
I want: no IT involvement, unlimited scalability, flexible reporting, multiple
(API) data sources and ease of use!
There's got to be another way!
@basgr from @peakaceag54
(And everyone at #SMX gets this as a gift - for free!)
We've thought of something:
pa.ag@peakaceag @basgr from @peakaceag55
Say hello to the Peak Ace log file auditing stack
Log files are stored in Google Cloud Storage, processed in Dataprep, exported to BigQuery and
visualised in Data Studio via the BigQuery Connector.
Google Data Studio
Data
transmission
Display
dataImport
Google Dataprep Google BigQuery
1
Log files
GSC
API v3
GA
API v4
GA
GSC
2
3
65
Google Apps Script
DeepCrawl
API
4
86 7
@basgr from @peakaceag56
Individual reports, tailored to your needs
And what do the results look like?
pa.ag@peakaceag @basgr from @peakaceag57
pa.ag@peakaceag @basgr from @peakaceag58
pa.ag@peakaceag @basgr from @peakaceag59
@basgr from @peakaceag60
Connect and conquer…
How does it work?
pa.ag@peakaceag @basgr from @peakaceag61
#1 Log file data from web servers, CDN, cache, etc.
How often do bots actually crawl? What do they crawl and when?
Source: https://pa.ag/2zs9lcY
Goal: improve site architecture by
analysing real bot crawling data.
▪ Amount of crawls/requests by bot type
▪ Identification of crawling patterns
▪ Overview of errors
▪ 3xx
▪ 4xx
▪ 5xx
Log filesGoogle Cloud
Storage
Import as text files
(exclude IP addresses!)
@basgr from @peakaceag62
15TB (per one file) to be pushed in Big Query
Size is absolutely NOT an issue
@basgr from @peakaceag63
nginx / Apache / etc. >> fluentd >> Big Query
Stand-alone files are messy, agreed.
pa.ag@peakaceag @basgr from @peakaceag64
#2 Google Analytics API
Enrich reports with traffic, engagement, behavioural and page speed data
Goal: compare crawling behaviour with user & loading
time data.
URL-based data on important engagement metrics:
▪ Sessions
▪ Users
▪ Bounce rate
▪ Session duration
▪ Avg. time on page
▪ Avg. server response time
▪ Avg. page load time
▪ …
Google Analytics
Reporting API v4
pa.ag@peakaceag @basgr from @peakaceag65
#3 Google Search Console API
Organic search performance data directly from Google
Goal: compare crawling behaviour with organic click
data & e.g. retrieve reported crawling errors.
Organic click data
▪ Clicks
▪ Impressions
▪ Device
▪ …
URL-based server response data
▪ Status code
Google Search
Console API v3
pa.ag@peakaceag @basgr from @peakaceag66
#4 DeepCrawl API
Website architecture, status codes, indexing directives, etc.
Goal: capture indexing directives, response codes and
more.
DeepCrawl
API
pa.ag@peakaceag @basgr from @peakaceag67
#5 Google Apps Scripts for GA, GSC & DeepCrawl
API access: capture multiple dimensions and metrics from GA, retrieve GSC crawl and
search analysis data and DeepCrawl crawl & analysis data
Source: https://pa.ag/2OWnjJa
Goal: send data (via/from the respective API) to BigQuery
and store the data there.
Google Apps Script
pa.ag@peakaceag @basgr from @peakaceag68
#6 Google Cloud Dataprep
Clean and process the data. Afterwards, combine these various sources with
several joins so that they‘re ready for visualisation.
Source: https://pa.ag/2Q6rEde
Goal: combine data from log files, GSC, GA & DeepCrawl
within/by using processing flows.
Dataprep: “Excel with super rocket fuel“
▪ Amazing RegEx support
▪ Select data, receive automated
proposals for processing
▪ Join data sources by e.g. full
inner/outer join, left/right outer join…
Google Apps Script
@basgr from @peakaceag69
And use Google Data Studio to visualise:
Save everything to BigQuery
pa.ag@peakaceag @basgr from @peakaceag70
pa.ag@peakaceag @basgr from @peakaceag71
pa.ag@peakaceag @basgr from @peakaceag72
pa.ag@peakaceag @basgr from @peakaceag73
pa.ag@peakaceag @basgr from @peakaceag74
Log file auditing is not a project, but a process!
Integrate log file auditing into your regular
SEO workflow; one-off audits are good to
begin with, but they really become invaluable
if you combine them with web crawl data and
perform them on an on-going basis.
pa.ag@peakaceag @basgr from @peakaceag75
bg@pa.ag
Slides? No problem:
https://pa.ag/smxl19logs
You want our log file setup (for free)?
e-mail us > hi@pa.ag
Bastian Grimm
twitter.com/peakaceag
facebook.com/peakaceag
www.pa.ag
ALWAYS LOOKING FOR TALENT! CHECK OUT JOBS.PA.AG
WINNER

More Related Content

What's hot

What's hot (18)

Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018Super speed around the globe - SearchLeeds 2018
Super speed around the globe - SearchLeeds 2018
 
A Technical Look at Content - PUBCON SFIMA 2017 - Patrick Stox
A Technical Look at Content - PUBCON SFIMA 2017 - Patrick StoxA Technical Look at Content - PUBCON SFIMA 2017 - Patrick Stox
A Technical Look at Content - PUBCON SFIMA 2017 - Patrick Stox
 
Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018Welcome to a new reality - DeepCrawl Webinar 2018
Welcome to a new reality - DeepCrawl Webinar 2018
 
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS MeetupReact JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
 
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick StoxWhat's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
 
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
 
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
 
Whats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 DublinWhats Next in SEO & CRO - 3XE Conference 2018 Dublin
Whats Next in SEO & CRO - 3XE Conference 2018 Dublin
 
An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...
An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...
An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...
 
Crawl Budget - Some Insights & Ideas @ seokomm 2015
Crawl Budget - Some Insights & Ideas @ seokomm 2015Crawl Budget - Some Insights & Ideas @ seokomm 2015
Crawl Budget - Some Insights & Ideas @ seokomm 2015
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
 
AMP - SMX München 2018
AMP - SMX München 2018AMP - SMX München 2018
AMP - SMX München 2018
 
Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Ho...
Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Ho...Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Ho...
Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Ho...
 
Www amazon com-report
Www amazon com-reportWww amazon com-report
Www amazon com-report
 
Pubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick Stox
Pubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick StoxPubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick Stox
Pubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick Stox
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
 

Similar to Advanced data-driven technical SEO - SMX London 2019

Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docxRead This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
sodhi3
 

Similar to Advanced data-driven technical SEO - SMX London 2019 (20)

SEARCH Y - Bastian Grimm - Migrations Best Practices
SEARCH Y - Bastian Grimm -  Migrations Best PracticesSEARCH Y - Bastian Grimm -  Migrations Best Practices
SEARCH Y - Bastian Grimm - Migrations Best Practices
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
How Search Works
How Search WorksHow Search Works
How Search Works
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
 
Troubleshooting Technical SEO Problems - Patrick Stox - Raleigh SEO Meetup
Troubleshooting Technical SEO Problems - Patrick Stox - Raleigh SEO MeetupTroubleshooting Technical SEO Problems - Patrick Stox - Raleigh SEO Meetup
Troubleshooting Technical SEO Problems - Patrick Stox - Raleigh SEO Meetup
 
On-Page SEO EXTREME - SEOZone Istanbul 2013
On-Page SEO EXTREME - SEOZone Istanbul 2013On-Page SEO EXTREME - SEOZone Istanbul 2013
On-Page SEO EXTREME - SEOZone Istanbul 2013
 
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick StoxSMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
 
Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)Log analysis and pro use cases for search marketers online version (1)
Log analysis and pro use cases for search marketers online version (1)
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
A Guide to Log Analysis with Big Query
A Guide to Log Analysis with Big QueryA Guide to Log Analysis with Big Query
A Guide to Log Analysis with Big Query
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
SearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your Logs
SearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your LogsSearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your Logs
SearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your Logs
 
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxIntegrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
 
Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2
 
International SEO: The Weird Technical Parts - Pubcon Vegas 2019 Patrick Stox
International SEO: The Weird Technical Parts - Pubcon Vegas 2019 Patrick StoxInternational SEO: The Weird Technical Parts - Pubcon Vegas 2019 Patrick Stox
International SEO: The Weird Technical Parts - Pubcon Vegas 2019 Patrick Stox
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
Demand Quest SEO training session 2
Demand Quest SEO training session 2Demand Quest SEO training session 2
Demand Quest SEO training session 2
 
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docxRead This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 

More from Bastian Grimm

More from Bastian Grimm (17)

SEOday Köln 2020 - Surprise, Surprise - 5 SEO secrets
SEOday Köln 2020 - Surprise, Surprise - 5 SEO secretsSEOday Köln 2020 - Surprise, Surprise - 5 SEO secrets
SEOday Köln 2020 - Surprise, Surprise - 5 SEO secrets
 
Data-driven Technical SEO: Logfile Auditing - SEOkomm 2018
Data-driven Technical SEO: Logfile Auditing - SEOkomm 2018Data-driven Technical SEO: Logfile Auditing - SEOkomm 2018
Data-driven Technical SEO: Logfile Auditing - SEOkomm 2018
 
Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018Web Performance Madness - brightonSEO 2018
Web Performance Madness - brightonSEO 2018
 
Digitale Assistenzsysteme - SMX München 2018
Digitale Assistenzsysteme - SMX München 2018Digitale Assistenzsysteme - SMX München 2018
Digitale Assistenzsysteme - SMX München 2018
 
How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018How fast is fast enough - SMX West 2018
How fast is fast enough - SMX West 2018
 
Migration Best-Practices: So gelingt der erfolgreiche Relaunch - SEOkomm 2017
Migration Best-Practices: So gelingt der erfolgreiche Relaunch - SEOkomm 2017Migration Best-Practices: So gelingt der erfolgreiche Relaunch - SEOkomm 2017
Migration Best-Practices: So gelingt der erfolgreiche Relaunch - SEOkomm 2017
 
Digitale Assistenten - OMX 2017
Digitale Assistenten - OMX 2017Digitale Assistenten - OMX 2017
Digitale Assistenten - OMX 2017
 
Welcome to a New Reality - SEO goes Mobile First in 2017
Welcome to a New Reality - SEO goes Mobile First in 2017Welcome to a New Reality - SEO goes Mobile First in 2017
Welcome to a New Reality - SEO goes Mobile First in 2017
 
Welcome to a New Reality - SEO goes Mobile First in 2017
Welcome to a New Reality - SEO goes Mobile First in 2017Welcome to a New Reality - SEO goes Mobile First in 2017
Welcome to a New Reality - SEO goes Mobile First in 2017
 
Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...
 
HTTPs Migration How To - SMX München 2017
HTTPs Migration How To - SMX München 2017HTTPs Migration How To - SMX München 2017
HTTPs Migration How To - SMX München 2017
 
Keyword Strategie: Do's & Don'ts bei der Keyword Recherche - SMX München 2017
Keyword Strategie: Do's & Don'ts bei der Keyword Recherche - SMX München 2017Keyword Strategie: Do's & Don'ts bei der Keyword Recherche - SMX München 2017
Keyword Strategie: Do's & Don'ts bei der Keyword Recherche - SMX München 2017
 
International Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 BarcelonaInternational Site Speed Tweaks - ISS 2017 Barcelona
International Site Speed Tweaks - ISS 2017 Barcelona
 
Technical SEO: 2017 Edition - SEO & Love Verona 2017
Technical SEO: 2017 Edition - SEO & Love Verona 2017Technical SEO: 2017 Edition - SEO & Love Verona 2017
Technical SEO: 2017 Edition - SEO & Love Verona 2017
 
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
 
Technical SEO: 2016 Edition - SEODAY 2016
Technical SEO: 2016 Edition - SEODAY 2016Technical SEO: 2016 Edition - SEODAY 2016
Technical SEO: 2016 Edition - SEODAY 2016
 
Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016
 

Recently uploaded

一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书
A
 
原版定制美国加州大学河滨分校毕业证原件一模一样
原版定制美国加州大学河滨分校毕业证原件一模一样原版定制美国加州大学河滨分校毕业证原件一模一样
原版定制美国加州大学河滨分校毕业证原件一模一样
A
 
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
rgdasda
 
一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理
A
 
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
Fi
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
musaddumba454
 
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
gfhdsfr
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
lolsDocherty
 
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
gfhdsfr
 
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
Fir
 
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
rgdasda
 
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
Fir
 

Recently uploaded (20)

一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书一比一定制加州大学欧文分校毕业证学位证书
一比一定制加州大学欧文分校毕业证学位证书
 
原版定制美国加州大学河滨分校毕业证原件一模一样
原版定制美国加州大学河滨分校毕业证原件一模一样原版定制美国加州大学河滨分校毕业证原件一模一样
原版定制美国加州大学河滨分校毕业证原件一模一样
 
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
一比一定制(OSU毕业证书)美国俄亥俄州立大学毕业证学位证书
 
GOOGLE Io 2024 At takes center stage.pdf
GOOGLE Io 2024 At takes center stage.pdfGOOGLE Io 2024 At takes center stage.pdf
GOOGLE Io 2024 At takes center stage.pdf
 
Statistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdfStatistical Analysis of DNS Latencies.pdf
Statistical Analysis of DNS Latencies.pdf
 
一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理一比一原版布兰迪斯大学毕业证如何办理
一比一原版布兰迪斯大学毕业证如何办理
 
Free on Wednesdays T Shirts Free on Wednesdays Sweatshirts
Free on Wednesdays T Shirts Free on Wednesdays SweatshirtsFree on Wednesdays T Shirts Free on Wednesdays Sweatshirts
Free on Wednesdays T Shirts Free on Wednesdays Sweatshirts
 
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdf
 
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
原版定制(爱大毕业证书)英国爱丁堡大学毕业证原件一模一样
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
 
Reggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirts
 
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
一比一原版(Exon毕业证书)英国埃克塞特大学毕业证如何办理
 
Development Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of appsDevelopment Lifecycle.pptx for the secure development of apps
Development Lifecycle.pptx for the secure development of apps
 
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
 
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
原版定制(PSU毕业证书)美国宾州州立大学毕业证原件一模一样
 
Premier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdfPremier Mobile App Development Agency in USA.pdf
Premier Mobile App Development Agency in USA.pdf
 
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
 
Free scottie t shirts Free scottie t shirts
Free scottie t shirts Free scottie t shirtsFree scottie t shirts Free scottie t shirts
Free scottie t shirts Free scottie t shirts
 

Advanced data-driven technical SEO - SMX London 2019

  • 1. @basgr from @peakaceag1 Bastian Grimm, Peak Ace AG | @basgr Merging your logfiles, GA, GSC & web crawl data for better SEO insights Advanced data-driven technical SEO
  • 2. @basgr from @peakaceag2 And why are log files important for your SEO work? Why should you care?
  • 3. pa.ag@peakaceag @basgr from @peakaceag3 I am a big fan of the various crawling tools, but… It’s only the access log files that demonstrate how a search engine’s crawler is behaving on your site; all crawling tools are simply trying to simulate their behaviour!
  • 4. @basgr from @peakaceag4 You need to see which pages are being prioritised by the search engines and should therefore be considered the most important 1. Understand crawl priorities
  • 5. @basgr from @peakaceag5 Google may reduce its crawling behaviour/frequency & eventually rank you lower if you are constantly providing a large amount of errors 2. Prevent reduced crawling
  • 6. @basgr from @peakaceag6 It’s essential to identify any crawl shortcomings (such as hierarchy or internal link structure) with potential site-wide implications 3. Understand global issues
  • 7. @basgr from @peakaceag7 You need to ensure that Google crawls everything important: primarily ranking, relevant content, but also fresh & older items 4. Ensure proper crawling
  • 8. @basgr from @peakaceag8 It’s important to ensure that any gained link equity will always be passed using proper links and/or redirects 5. Ensure proper linking
  • 9. @basgr from @peakaceag9 Keep in mind, details depend on the individual setup! The characteristics of a log file
  • 10. @basgr from @peakaceag10 …depending on your webserver (Apache, nginx, IIS, etc.), caching and its configuration. Make sure to understand your setup first! Content & structure can vary…
  • 11. pa.ag@peakaceag @basgr from @peakaceag11 What does a log file usually look like? Server IP/host name1 Timestamp (date & time)2 Method (GET/POST/HEAD)3 Request URL4 HTTP status code5 Size in bytes6 Referrer7 User-agent8 188.65.114.xxx [21/May/2019:02:00:00 -0100] /resources/whitepapers/seo-whitepaper/ HTTP/1.1" 200 512 "http://www.wikipedia.org/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "GET
  • 12. @basgr from @peakaceag12 Log file data can be quite overwhelming because you can do so many different things; make sure you’ve got your questions prepared! You need to ask the right questions!
  • 13. pa.ag@peakaceag @basgr from @peakaceag13 Log file data can be different e.g. to Google Analytics data While log files are direct, server-side pieces of information, Google Analytics uses client- side code. As the data sets are coming from two different sources, they can be different! The configuration within Google Analytics also leads to data differences when compared to the log files, i.e. filters!
  • 14. @basgr from @peakaceag14 Be cautious when requesting log files from your clients Frequently asked questions
  • 15. @basgr from @peakaceag15 We only care about crawlers such as Google and Bing; no need for any user data (operating system, browser, phone number, usernames, etc.) 1. Personal information in logs?
  • 16. @basgr from @peakaceag16 If you are running a cache server and/or a CDN which creates logs elsewhere, we will also need these logs 2. Separate multi-location logs?
  • 17. @basgr from @peakaceag17 There are different ways you could approach this: Log file auditing tools
  • 18. pa.ag@peakaceag @basgr from @peakaceag18 There are different ways you could approach this:
  • 19. pa.ag@peakaceag @basgr from @peakaceag19 Do-it-yourself solution based on Excel You’d have to manually build filtering, cross-references, etc. – it just doesn’t scale!
  • 20. pa.ag@peakaceag @basgr from @peakaceag20 Screaming Frog Log File Analyser Beginners’ level, desktop-based log file auditing with pre-defined reports.
  • 21. @basgr from @peakaceag21 No sharing capabilities, log files need to be manually up/downloaded, which is usually problematic for larger files, etc. Desktop solutions are limited
  • 22. pa.ag@peakaceag @basgr from @peakaceag22 Splunk or Sumo Logic: proprietary, paid software solutions Enterprise tools such as Splunk usually come with a hefty (volume-based) price tag. In all fairness though: these solutions offer features way beyond log file monitoring! Image sources: https://pa.ag/2srgTZu (splunk) & https://pa.ag/2JcuiLt (sumologic)
  • 23. pa.ag@peakaceag @basgr from @peakaceag23 The Elastic Stack (ELK): Elasticsearch, Logstash & Kibana Elasticsearch: search & analytics engine, Logstash: server-side data processing pipeline, Kibana: data visualisation (charts, graphs, etc.) – all open source. Image source: https://pa.ag/2JbFUhP
  • 24. pa.ag@peakaceag @basgr from @peakaceag24 Other SaaS solutions: logrunner.io, logz.io (ELK) & Loggly Especially logrunner.io, which has a strong focus on SEO-based auditing (dashboards etc.).
  • 25. pa.ag@peakaceag @basgr from @peakaceag25 crawlOPTIMIZER: SaaS Logfile Auditing, made in Vienna BRPs (Business Relevant Pages) with dedicated evaluations of these as top USP.
  • 26. @basgr from @peakaceag26 No messing around with exports, up/downloads, easy sharing capabilities and the ability to deal with massive volumes, etc. The beauty of SaaS: almost real time
  • 27. @basgr from @peakaceag27 For an easy start: trend monitoring (over time) & gathering insights Let’s have a look at some data
  • 28. pa.ag@peakaceag @basgr from @peakaceag28 Most obvious approach: spotting anomalies vs. time frame Tip: this is why it makes a lot of sense to check your log files regularly (e.g. daily). This looks unusual; take it as a starting point for further investigation.
  • 29. pa.ag@peakaceag @basgr from @peakaceag29 User crawling frequencies over time Understanding patterns and irregularities can be very helpful - always look at the crawl behaviour of individual users over time.
  • 30. @basgr from @peakaceag30 Use log files to look for spam bots or scrapers to block! What other ”bots“ access your site?
  • 31. pa.ag@peakaceag @basgr from @peakaceag31 Not everyone is who they claim to be! The easiest way to detect if Googlebot really is Googlebot: run a reverse DNS lookup. Bingbot can also be verified via *.search.msn.com. Source: https://pa.ag/2JqOk8d
  • 32. pa.ag@peakaceag @basgr from @peakaceag32 What are the most crawled Googlebot pages? Also, verify if they coincide with your domains’ most important ones. Understand if these are really your most valuable pages?
  • 33. pa.ag@peakaceag @basgr from @peakaceag33 Breakdown of crawl requests & status codes per directory You’d easily see if one of your main directories encountered crawling/response issues. Tip: establish this on a regular basis to ensure continued performance of top directories.
  • 34. @basgr from @peakaceag34 And respective actions based on those findings Advanced auditing for SEO
  • 36. pa.ag@peakaceag @basgr from @peakaceag36 Identify any kind of ”wrong“ redirect: 302/304/307/308 Action: change to 301 (except geo redirects); also watch out for redirect chains! Investigate further to see what’s in there
  • 38. pa.ag@peakaceag @basgr from @peakaceag38 4xx client errors: too many are a sign of poor site health Action: recover (200), redirect (301) or kill off entirely (410)
  • 39. pa.ag@peakaceag @basgr from @peakaceag39 Googlebot can‘t login… (403: forbidden) If it‘s linked, Google will try to crawl it – they are greedy!
  • 40. pa.ag@peakaceag @basgr from @peakaceag40 5xx server errors: usually infrastructure-related Action: watch closely and/or talk to IT (server availability, high load, etc.) Check consistency; what happens when re-trying?
  • 41. @basgr from @peakaceag41 3. Crawl priority
  • 42. pa.ag@peakaceag @basgr from @peakaceag42 Understanding the most/least crawled URLs and folders Action: highly crawled pages/folders could be used e.g. for additional internal linking (add link hubs), low crawled areas need to be linked more prominently. Can be used for additional, internal linking (improve discovery of other content) Clearly weak, either irrelevant (remove) or requires more attention
  • 44. pa.ag@peakaceag @basgr from @peakaceag44 Investigate if (new) URLs have been crawled at all Action: if relevant URLs haven’t been discovered/crawled at all, your internal linking is probably too weak. Consider XML sitemaps, better/more prominent linking, etc. If these are important URLs, you might have a problem!
  • 46. pa.ag@peakaceag @basgr from @peakaceag46 I‘m sure you‘ve all seen this? Source: https://pa.ag/2LUnt2R
  • 47. pa.ag@peakaceag @basgr from @peakaceag47 This is what the Google Webmaster Central blog says: Source: https://pa.ag/2HhsYoz Wasting server resources on pages […] will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.
  • 48. pa.ag@peakaceag @basgr from @peakaceag48 If you have ever had to deal with sites like these… Properly dealing with >30,000,000 crawlable URLs (due to parameter usage) certainly makes a difference in organic performance!
  • 49. pa.ag@peakaceag @basgr from @peakaceag49 URL parameters cause most problems (Combined) URL parameters often generate millions of unnecessary URLs, especially for large domains, which Googlebot diligently crawls (once found).
  • 50. pa.ag@peakaceag @basgr from @peakaceag50 URL parameter behaviour over time Constantly be on the lookout for new parameters as well as significantly increased crawling for known parameters.
  • 51. pa.ag@peakaceag @basgr from @peakaceag51 A brief overview: #SMXInsights 01 No one-size-fits-all solution Log file size, quantity & availability are all decisive with regards to tool selection. 02 Preparation is key Concrete questions help to generate efficient analysis. 03 Crawl data only Be precise with your requests (to the IT department) - you just want to know what the search engines are doing! 04 Reverse DNS use Not every crawler is who they pretend to be - do not "blindly" trust in the user-agent string 05 URL parameters These are almost always the biggest problem (combinations, order, consistency) - audit them first.
  • 52. @basgr from @peakaceag52 Oh yeah, there’s one more thing …
  • 53. @basgr from @peakaceag53 I want: no IT involvement, unlimited scalability, flexible reporting, multiple (API) data sources and ease of use! There's got to be another way!
  • 54. @basgr from @peakaceag54 (And everyone at #SMX gets this as a gift - for free!) We've thought of something:
  • 55. pa.ag@peakaceag @basgr from @peakaceag55 Say hello to the Peak Ace log file auditing stack Log files are stored in Google Cloud Storage, processed in Dataprep, exported to BigQuery and visualised in Data Studio via the BigQuery Connector. Google Data Studio Data transmission Display dataImport Google Dataprep Google BigQuery 1 Log files GSC API v3 GA API v4 GA GSC 2 3 65 Google Apps Script DeepCrawl API 4 86 7
  • 56. @basgr from @peakaceag56 Individual reports, tailored to your needs And what do the results look like?
  • 60. @basgr from @peakaceag60 Connect and conquer… How does it work?
  • 61. pa.ag@peakaceag @basgr from @peakaceag61 #1 Log file data from web servers, CDN, cache, etc. How often do bots actually crawl? What do they crawl and when? Source: https://pa.ag/2zs9lcY Goal: improve site architecture by analysing real bot crawling data. ▪ Amount of crawls/requests by bot type ▪ Identification of crawling patterns ▪ Overview of errors ▪ 3xx ▪ 4xx ▪ 5xx Log filesGoogle Cloud Storage Import as text files (exclude IP addresses!)
  • 62. @basgr from @peakaceag62 15TB (per one file) to be pushed in Big Query Size is absolutely NOT an issue
  • 63. @basgr from @peakaceag63 nginx / Apache / etc. >> fluentd >> Big Query Stand-alone files are messy, agreed.
  • 64. pa.ag@peakaceag @basgr from @peakaceag64 #2 Google Analytics API Enrich reports with traffic, engagement, behavioural and page speed data Goal: compare crawling behaviour with user & loading time data. URL-based data on important engagement metrics: ▪ Sessions ▪ Users ▪ Bounce rate ▪ Session duration ▪ Avg. time on page ▪ Avg. server response time ▪ Avg. page load time ▪ … Google Analytics Reporting API v4
  • 65. pa.ag@peakaceag @basgr from @peakaceag65 #3 Google Search Console API Organic search performance data directly from Google Goal: compare crawling behaviour with organic click data & e.g. retrieve reported crawling errors. Organic click data ▪ Clicks ▪ Impressions ▪ Device ▪ … URL-based server response data ▪ Status code Google Search Console API v3
  • 66. pa.ag@peakaceag @basgr from @peakaceag66 #4 DeepCrawl API Website architecture, status codes, indexing directives, etc. Goal: capture indexing directives, response codes and more. DeepCrawl API
  • 67. pa.ag@peakaceag @basgr from @peakaceag67 #5 Google Apps Scripts for GA, GSC & DeepCrawl API access: capture multiple dimensions and metrics from GA, retrieve GSC crawl and search analysis data and DeepCrawl crawl & analysis data Source: https://pa.ag/2OWnjJa Goal: send data (via/from the respective API) to BigQuery and store the data there. Google Apps Script
  • 68. pa.ag@peakaceag @basgr from @peakaceag68 #6 Google Cloud Dataprep Clean and process the data. Afterwards, combine these various sources with several joins so that they‘re ready for visualisation. Source: https://pa.ag/2Q6rEde Goal: combine data from log files, GSC, GA & DeepCrawl within/by using processing flows. Dataprep: “Excel with super rocket fuel“ ▪ Amazing RegEx support ▪ Select data, receive automated proposals for processing ▪ Join data sources by e.g. full inner/outer join, left/right outer join… Google Apps Script
  • 69. @basgr from @peakaceag69 And use Google Data Studio to visualise: Save everything to BigQuery
  • 74. pa.ag@peakaceag @basgr from @peakaceag74 Log file auditing is not a project, but a process! Integrate log file auditing into your regular SEO workflow; one-off audits are good to begin with, but they really become invaluable if you combine them with web crawl data and perform them on an on-going basis.
  • 75. pa.ag@peakaceag @basgr from @peakaceag75 bg@pa.ag Slides? No problem: https://pa.ag/smxl19logs You want our log file setup (for free)? e-mail us > hi@pa.ag Bastian Grimm twitter.com/peakaceag facebook.com/peakaceag www.pa.ag ALWAYS LOOKING FOR TALENT! CHECK OUT JOBS.PA.AG WINNER