Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced data-driven technical SEO - SMX London 2019


Published on

My deck from SMX London 2019 on merging logfiles with data from GA, GSC and web crawling for better SEO insights.

Published in: Internet
  • Be the first to comment

Advanced data-driven technical SEO - SMX London 2019

  1. 1. @basgr from @peakaceag1 Bastian Grimm, Peak Ace AG | @basgr Merging your logfiles, GA, GSC & web crawl data for better SEO insights Advanced data-driven technical SEO
  2. 2. @basgr from @peakaceag2 And why are log files important for your SEO work? Why should you care?
  3. 3. @basgr from @peakaceag3 I am a big fan of the various crawling tools, but… It’s only the access log files that demonstrate how a search engine’s crawler is behaving on your site; all crawling tools are simply trying to simulate their behaviour!
  4. 4. @basgr from @peakaceag4 You need to see which pages are being prioritised by the search engines and should therefore be considered the most important 1. Understand crawl priorities
  5. 5. @basgr from @peakaceag5 Google may reduce its crawling behaviour/frequency & eventually rank you lower if you are constantly providing a large amount of errors 2. Prevent reduced crawling
  6. 6. @basgr from @peakaceag6 It’s essential to identify any crawl shortcomings (such as hierarchy or internal link structure) with potential site-wide implications 3. Understand global issues
  7. 7. @basgr from @peakaceag7 You need to ensure that Google crawls everything important: primarily ranking, relevant content, but also fresh & older items 4. Ensure proper crawling
  8. 8. @basgr from @peakaceag8 It’s important to ensure that any gained link equity will always be passed using proper links and/or redirects 5. Ensure proper linking
  9. 9. @basgr from @peakaceag9 Keep in mind, details depend on the individual setup! The characteristics of a log file
  10. 10. @basgr from @peakaceag10 …depending on your webserver (Apache, nginx, IIS, etc.), caching and its configuration. Make sure to understand your setup first! Content & structure can vary…
  11. 11. @basgr from @peakaceag11 What does a log file usually look like? Server IP/host name1 Timestamp (date & time)2 Method (GET/POST/HEAD)3 Request URL4 HTTP status code5 Size in bytes6 Referrer7 User-agent8 [21/May/2019:02:00:00 -0100] /resources/whitepapers/seo-whitepaper/ HTTP/1.1" 200 512 "" "Mozilla/5.0 (compatible; Googlebot/2.1; +" "GET
  12. 12. @basgr from @peakaceag12 Log file data can be quite overwhelming because you can do so many different things; make sure you’ve got your questions prepared! You need to ask the right questions!
  13. 13. @basgr from @peakaceag13 Log file data can be different e.g. to Google Analytics data While log files are direct, server-side pieces of information, Google Analytics uses client- side code. As the data sets are coming from two different sources, they can be different! The configuration within Google Analytics also leads to data differences when compared to the log files, i.e. filters!
  14. 14. @basgr from @peakaceag14 Be cautious when requesting log files from your clients Frequently asked questions
  15. 15. @basgr from @peakaceag15 We only care about crawlers such as Google and Bing; no need for any user data (operating system, browser, phone number, usernames, etc.) 1. Personal information in logs?
  16. 16. @basgr from @peakaceag16 If you are running a cache server and/or a CDN which creates logs elsewhere, we will also need these logs 2. Separate multi-location logs?
  17. 17. @basgr from @peakaceag17 There are different ways you could approach this: Log file auditing tools
  18. 18. @basgr from @peakaceag18 There are different ways you could approach this:
  19. 19. @basgr from @peakaceag19 Do-it-yourself solution based on Excel You’d have to manually build filtering, cross-references, etc. – it just doesn’t scale!
  20. 20. @basgr from @peakaceag20 Screaming Frog Log File Analyser Beginners’ level, desktop-based log file auditing with pre-defined reports.
  21. 21. @basgr from @peakaceag21 No sharing capabilities, log files need to be manually up/downloaded, which is usually problematic for larger files, etc. Desktop solutions are limited
  22. 22. @basgr from @peakaceag22 Splunk or Sumo Logic: proprietary, paid software solutions Enterprise tools such as Splunk usually come with a hefty (volume-based) price tag. In all fairness though: these solutions offer features way beyond log file monitoring! Image sources: (splunk) & (sumologic)
  23. 23. @basgr from @peakaceag23 The Elastic Stack (ELK): Elasticsearch, Logstash & Kibana Elasticsearch: search & analytics engine, Logstash: server-side data processing pipeline, Kibana: data visualisation (charts, graphs, etc.) – all open source. Image source:
  24. 24. @basgr from @peakaceag24 Other SaaS solutions:, (ELK) & Loggly Especially, which has a strong focus on SEO-based auditing (dashboards etc.).
  25. 25. @basgr from @peakaceag25 crawlOPTIMIZER: SaaS Logfile Auditing, made in Vienna BRPs (Business Relevant Pages) with dedicated evaluations of these as top USP.
  26. 26. @basgr from @peakaceag26 No messing around with exports, up/downloads, easy sharing capabilities and the ability to deal with massive volumes, etc. The beauty of SaaS: almost real time
  27. 27. @basgr from @peakaceag27 For an easy start: trend monitoring (over time) & gathering insights Let’s have a look at some data
  28. 28. @basgr from @peakaceag28 Most obvious approach: spotting anomalies vs. time frame Tip: this is why it makes a lot of sense to check your log files regularly (e.g. daily). This looks unusual; take it as a starting point for further investigation.
  29. 29. @basgr from @peakaceag29 User crawling frequencies over time Understanding patterns and irregularities can be very helpful - always look at the crawl behaviour of individual users over time.
  30. 30. @basgr from @peakaceag30 Use log files to look for spam bots or scrapers to block! What other ”bots“ access your site?
  31. 31. @basgr from @peakaceag31 Not everyone is who they claim to be! The easiest way to detect if Googlebot really is Googlebot: run a reverse DNS lookup. Bingbot can also be verified via * Source:
  32. 32. @basgr from @peakaceag32 What are the most crawled Googlebot pages? Also, verify if they coincide with your domains’ most important ones. Understand if these are really your most valuable pages?
  33. 33. @basgr from @peakaceag33 Breakdown of crawl requests & status codes per directory You’d easily see if one of your main directories encountered crawling/response issues. Tip: establish this on a regular basis to ensure continued performance of top directories.
  34. 34. @basgr from @peakaceag34 And respective actions based on those findings Advanced auditing for SEO
  35. 35. @basgr from @peakaceag35 1. Redirects
  36. 36. @basgr from @peakaceag36 Identify any kind of ”wrong“ redirect: 302/304/307/308 Action: change to 301 (except geo redirects); also watch out for redirect chains! Investigate further to see what’s in there
  37. 37. @basgr from @peakaceag37 2. Crawl errors
  38. 38. @basgr from @peakaceag38 4xx client errors: too many are a sign of poor site health Action: recover (200), redirect (301) or kill off entirely (410)
  39. 39. @basgr from @peakaceag39 Googlebot can‘t login… (403: forbidden) If it‘s linked, Google will try to crawl it – they are greedy!
  40. 40. @basgr from @peakaceag40 5xx server errors: usually infrastructure-related Action: watch closely and/or talk to IT (server availability, high load, etc.) Check consistency; what happens when re-trying?
  41. 41. @basgr from @peakaceag41 3. Crawl priority
  42. 42. @basgr from @peakaceag42 Understanding the most/least crawled URLs and folders Action: highly crawled pages/folders could be used e.g. for additional internal linking (add link hubs), low crawled areas need to be linked more prominently. Can be used for additional, internal linking (improve discovery of other content) Clearly weak, either irrelevant (remove) or requires more attention
  43. 43. @basgr from @peakaceag43 4. Last crawled
  44. 44. @basgr from @peakaceag44 Investigate if (new) URLs have been crawled at all Action: if relevant URLs haven’t been discovered/crawled at all, your internal linking is probably too weak. Consider XML sitemaps, better/more prominent linking, etc. If these are important URLs, you might have a problem!
  45. 45. @basgr from @peakaceag45 5. Crawl waste
  46. 46. @basgr from @peakaceag46 I‘m sure you‘ve all seen this? Source:
  47. 47. @basgr from @peakaceag47 This is what the Google Webmaster Central blog says: Source: Wasting server resources on pages […] will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.
  48. 48. @basgr from @peakaceag48 If you have ever had to deal with sites like these… Properly dealing with >30,000,000 crawlable URLs (due to parameter usage) certainly makes a difference in organic performance!
  49. 49. @basgr from @peakaceag49 URL parameters cause most problems (Combined) URL parameters often generate millions of unnecessary URLs, especially for large domains, which Googlebot diligently crawls (once found).
  50. 50. @basgr from @peakaceag50 URL parameter behaviour over time Constantly be on the lookout for new parameters as well as significantly increased crawling for known parameters.
  51. 51. @basgr from @peakaceag51 A brief overview: #SMXInsights 01 No one-size-fits-all solution Log file size, quantity & availability are all decisive with regards to tool selection. 02 Preparation is key Concrete questions help to generate efficient analysis. 03 Crawl data only Be precise with your requests (to the IT department) - you just want to know what the search engines are doing! 04 Reverse DNS use Not every crawler is who they pretend to be - do not "blindly" trust in the user-agent string 05 URL parameters These are almost always the biggest problem (combinations, order, consistency) - audit them first.
  52. 52. @basgr from @peakaceag52 Oh yeah, there’s one more thing …
  53. 53. @basgr from @peakaceag53 I want: no IT involvement, unlimited scalability, flexible reporting, multiple (API) data sources and ease of use! There's got to be another way!
  54. 54. @basgr from @peakaceag54 (And everyone at #SMX gets this as a gift - for free!) We've thought of something:
  55. 55. @basgr from @peakaceag55 Say hello to the Peak Ace log file auditing stack Log files are stored in Google Cloud Storage, processed in Dataprep, exported to BigQuery and visualised in Data Studio via the BigQuery Connector. Google Data Studio Data transmission Display dataImport Google Dataprep Google BigQuery 1 Log files GSC API v3 GA API v4 GA GSC 2 3 65 Google Apps Script DeepCrawl API 4 86 7
  56. 56. @basgr from @peakaceag56 Individual reports, tailored to your needs And what do the results look like?
  57. 57. @basgr from @peakaceag57
  58. 58. @basgr from @peakaceag58
  59. 59. @basgr from @peakaceag59
  60. 60. @basgr from @peakaceag60 Connect and conquer… How does it work?
  61. 61. @basgr from @peakaceag61 #1 Log file data from web servers, CDN, cache, etc. How often do bots actually crawl? What do they crawl and when? Source: Goal: improve site architecture by analysing real bot crawling data. ▪ Amount of crawls/requests by bot type ▪ Identification of crawling patterns ▪ Overview of errors ▪ 3xx ▪ 4xx ▪ 5xx Log filesGoogle Cloud Storage Import as text files (exclude IP addresses!)
  62. 62. @basgr from @peakaceag62 15TB (per one file) to be pushed in Big Query Size is absolutely NOT an issue
  63. 63. @basgr from @peakaceag63 nginx / Apache / etc. >> fluentd >> Big Query Stand-alone files are messy, agreed.
  64. 64. @basgr from @peakaceag64 #2 Google Analytics API Enrich reports with traffic, engagement, behavioural and page speed data Goal: compare crawling behaviour with user & loading time data. URL-based data on important engagement metrics: ▪ Sessions ▪ Users ▪ Bounce rate ▪ Session duration ▪ Avg. time on page ▪ Avg. server response time ▪ Avg. page load time ▪ … Google Analytics Reporting API v4
  65. 65. @basgr from @peakaceag65 #3 Google Search Console API Organic search performance data directly from Google Goal: compare crawling behaviour with organic click data & e.g. retrieve reported crawling errors. Organic click data ▪ Clicks ▪ Impressions ▪ Device ▪ … URL-based server response data ▪ Status code Google Search Console API v3
  66. 66. @basgr from @peakaceag66 #4 DeepCrawl API Website architecture, status codes, indexing directives, etc. Goal: capture indexing directives, response codes and more. DeepCrawl API
  67. 67. @basgr from @peakaceag67 #5 Google Apps Scripts for GA, GSC & DeepCrawl API access: capture multiple dimensions and metrics from GA, retrieve GSC crawl and search analysis data and DeepCrawl crawl & analysis data Source: Goal: send data (via/from the respective API) to BigQuery and store the data there. Google Apps Script
  68. 68. @basgr from @peakaceag68 #6 Google Cloud Dataprep Clean and process the data. Afterwards, combine these various sources with several joins so that they‘re ready for visualisation. Source: Goal: combine data from log files, GSC, GA & DeepCrawl within/by using processing flows. Dataprep: “Excel with super rocket fuel“ ▪ Amazing RegEx support ▪ Select data, receive automated proposals for processing ▪ Join data sources by e.g. full inner/outer join, left/right outer join… Google Apps Script
  69. 69. @basgr from @peakaceag69 And use Google Data Studio to visualise: Save everything to BigQuery
  70. 70. @basgr from @peakaceag70
  71. 71. @basgr from @peakaceag71
  72. 72. @basgr from @peakaceag72
  73. 73. @basgr from @peakaceag73
  74. 74. @basgr from @peakaceag74 Log file auditing is not a project, but a process! Integrate log file auditing into your regular SEO workflow; one-off audits are good to begin with, but they really become invaluable if you combine them with web crawl data and perform them on an on-going basis.
  75. 75. @basgr from @peakaceag75 Slides? No problem: You want our log file setup (for free)? e-mail us > Bastian Grimm ALWAYS LOOKING FOR TALENT! CHECK OUT JOBS.PA.AG WINNER