Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs

4,225 views

Published on

In the SEO industry, we obsess on everything Google says, from John Mueller dropping a hint in a Webmaster Hangout, to the ranking data we spend £1000s to gather. Yet we ignore the data Google throws at us every day, the crawling data. For the longest time, site crawls, traffic data, and rankings have been the pillars of SEO data gathering. Log files should join them as something everyone is doing. We'll go through how to get everything set-up, look at some of the tools to make it easy and repeatable and go through the kinds of analysis you can do to get insights from the data.

Published in: Marketing
  • It doesn't matter how long ago you broke up, these methods will make your Ex crawling back! See how at here. ▲▲▲ http://ow.ly/f23I301xGAo
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs

  1. 1. 2009
  2. 2. God it’s bad.
  3. 3. -$1.5 Billion
  4. 4. Why hasn’t Google seen the changes on my page?
  5. 5. How should I prioritise errors in Search Console?
  6. 6. Are my canonicals being respected?
  7. 7. Does Google think this page is important?
  8. 8. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  9. 9. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" IP Address
  10. 10. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Timestamp
  11. 11. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Request type
  12. 12. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Homepage
  13. 13. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Protocol
  14. 14. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Status Code
  15. 15. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Size of the page (in bytes)
  16. 16. What does a log look like? 123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html))" User Agent
  17. 17. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  18. 18. 5 things 2 3 4 51
  19. 19. 1 Diagnose crawling & indexation issues 2 3 4 51
  20. 20. Number of requests Five folders Googlebot crawled the most
  21. 21. Five folders Googlebot crawled the most Number of requests
  22. 22. % of Organic sessions VS % of crawl budget Sessions Crawl budget
  23. 23. 2 Prioritisation 2 3 4 51
  24. 24. example.com/article
  25. 25. Prioritizing 1 Full Print
  26. 26. example.com/article/full
  27. 27. example.com/article/print
  28. 28. Prioritizing 2
  29. 29. example.com/article/pdf
  30. 30. Prioritizing 3
  31. 31. Prioritizing 1 Full Print
  32. 32. 3 Spot bugs & view site health 2 3 4 51
  33. 33. Delayed errors with a limit of 1000
  34. 34. 4 How important does Google see parts of your site? 2 3 4 51
  35. 35. My SEO was as bad as my design
  36. 36. But at least my hair was better
  37. 37. teflsearch.com
  38. 38. teflsearch.com/job-results
  39. 39. teflsearch.com/job-results/country/china
  40. 40. teflsearch.com/jobadvert3455
  41. 41. Average number of times Googlebot crawled a template
  42. 42. 1. teflsearch.com 2. teflsearch.com/job-results 3. teflsearch.com/job-results/country/china 4. teflsearch.com/job-advert3455
  43. 43. 1. teflsearch.com 2. teflsearch.com/job-results 3. teflsearch.com/job-results/country/china 4. teflsearch.com/job-advert3455
  44. 44. teflsearch.com/job-results
  45. 45. Average number of times Googlebot crawled a template 35%
  46. 46. 5 How fresh does it think your content is? 2 3 4 51
  47. 47. bit.ly/moz-fresh
  48. 48. Average number of times a page template is crawled by Googlebot
  49. 49. ●Improve our internal linking ●Build trust with last modified date in sitemap
  50. 50. 2 3 4 51
  51. 51. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  52. 52. Talk to a developer and ask for information
  53. 53. Are all the logs in one place?
  54. 54. Hi x I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!). What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re finding etc. There are also some things that are really helpful for us to know when getting logs. Do the logs have any personal informationin? We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed. Do you have any sort of caching which would create separate sets of logs? If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just those from the server. (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache external images then we don’t need it). Are there any sub parts of your site which log to a different place? Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well. Do you log hostname? It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful to have that turned on now for any future analysis. Is there anything else we should know? Best, {x} Email for a developer
  55. 55. So we might have something that looks like this
  56. 56. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  57. 57. BigQuery
  58. 58. BigQuery
  59. 59. Google’s online database for data analysis.
  60. 60. 1. Ask powerful questions 2. Repeatable 3. Scaleable 4. Combine with crawl data 5. Easy to set-up 6. Easy to learn What do we want from analysing our logs?
  61. 61. 9,000,000 rows of data for 2 months. 400 - 800 queries
  62. 62. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  63. 63. Format the logs so we can import them into BigQuery Separate the Googlebot logs from all the other logs
  64. 64. Screaming Frog Log Analyser Code something
  65. 65. Screaming Frog Log Analyser
  66. 66. Code something
  67. 67. bit.ly/logs-code
  68. 68. What can you do with logs? PART 1: THE WHY Getting logs Analysing Logs Processing Logs PART 2: THE HOW
  69. 69. Our data in BQ
  70. 70. We make sure we got what we wanted
  71. 71. THE QUESTION: What is the total number of requests Googlebot makes each day to our site?
  72. 72. Our first SQL query SELECT timestamp FROM [mydata.log_analysis]
  73. 73. Our first SQL query SELECT timestamp FROM [mydata.log_analysis]
  74. 74. Our first SQL query SELECT DATE(timestamp) FROM [mydata.log_analysis]
  75. 75. Our first SQL query SELECT DATE(timestamp) FROM [mydata.log_analysis]
  76. 76. Our first SQL query SELECT DATE(timestamp) as date FROM [mydata.log_analysis]
  77. 77. Our first SQL query SELECT DATE(timestamp) as date FROM [mydata.log_analysis]
  78. 78. Our first SQL query SELECT DATE(timestamp) as date, count(*) FROM [mydata.log_analysis]
  79. 79. Our first SQL query SELECT DATE(timestamp) as date, count(*) FROM [mydata.log_analysis] GROUP BY date
  80. 80. Our first SQL query SELECT DATE(timestamp) as date, count(*) as number_of_requests FROM [mydata.log_analysis] GROUP BY date
  81. 81. Our first SQL query SELECT DATE(timestamp) as date, count(*) as number_of_requests FROM [mydata.log_analysis] GROUP BY date
  82. 82. Comparing logs to GSC crawl volume Number of requests
  83. 83. Run queries Find something weird Go look at crawl & website
  84. 84. Our data in BQ
  85. 85. 1 Diagnose crawling & indexation issues
  86. 86. 2 Prioritisation
  87. 87. 3 Spot bugs & view site health
  88. 88. 4 How important does Google see parts of your site?
  89. 89. 5 How fresh does it think your content is?
  90. 90. 1 Diagnose crawling & indexation issues 4 How important does Google see parts of your site?
  91. 91. What are the top 20 URLs crawled by Google over our logs?
  92. 92. Login is my top crawled page and then search?
  93. 93. What are the top 20 page_path_1 folders crawled by Google over our logs?
  94. 94. Location folders are taking more than 70% of my budget
  95. 95. Getting data by the day Page Number of Googlebot Requests page1 200,000 page2 120,000
  96. 96. Number of Googlebot requests day by day
  97. 97. 3 Spot bugs & view site health
  98. 98. How many of each status code does Google find per day over our logs?
  99. 99. Number of Googlebot requests day by day
  100. 100. What are most requested 404 URLs by Googlebot over the past 30 days?
  101. 101. Boy does it want that ad-tech snippet
  102. 102. 5 How fresh does it think your content is?
  103. 103. How many times on average is each page in a page template crawled a day?
  104. 104. Average number of times a page template is crawled by Googlebot
  105. 105. How long does it take for a page to be discovered after being published?
  106. 106. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs?
  107. 107. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl?
  108. 108. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled?
  109. 109. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website?
  110. 110. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day?
  111. 111. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day? Which directories have the most 301 & 404 error codes?
  112. 112. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day? Which directories have the most 301 & 404 error codes? Which pages are crawled with parameters and without parameters?
  113. 113. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day? Which directories have the most 301 & 404 error codes? Which pages are crawled with parameters and without parameters? Which pages are only partly downloaded? How many hits does each section get, when the sections are classified in an external dataset?
  114. 114. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day? Which directories have the most 301 & 404 error codes? Which pages are crawled with parameters and without parameters? Which pages are only partly downloaded? How many hits does each section get, when the sections are classified in an external dataset? What percentage of a directory was crawled over the past 30 days?
  115. 115. How long does it take for a page to be discovered after being published? What are the top 20 combinations of page_path_1 & path_path_2 folders crawled by Google over the time period of our logs? Which pages have requests from Googlebot, which don’t appear in our crawl? What are the top non-canonical pages being crawled? Which are most crawled parameters on the website? How often are the most visited parameters crawled each day? Which directories have the most 301 & 404 error codes? Which pages are crawled with parameters and without parameters? Which pages are only partly downloaded? How many hits does each section get, when the sections are classified in an external dataset? What percentage of a directory was crawled over the past 30 days? What are the total number of requests across two different time periods?
  116. 116. That’s a lot of questions
  117. 117. bit.ly/logs-resource
  118. 118. bit.ly/logs-resource
  119. 119. bit.ly/logs-resource
  120. 120. bit.ly/logs-resource
  121. 121. In Summary
  122. 122. This is the thing you’re probably not doing
  123. 123. bit.ly/logs-resource @dom_woodman
  124. 124. bit.ly/logs-resource @dom_woodman

×