This is the slide deck on how to perform log analysis with BigQuery. The companion guide is here which has most of this information in written format. https://www.distilled.net/resources/guide-to-log-analysis-with-big-query/
28. What can you do with
logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
29.
30.
31. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
IP Address
32. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Timestamp
33. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Request type
34. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Homepage
35. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Protocol
36. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Status Code
37. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Size of the page (in bytes)
38. What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage
HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html))"
User Agent
39. What can you do with
logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
83. Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about
the log set-up (as well as with getting the logs!).
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re
spending their time, the status code errors they’re finding etc.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal informationin?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be
removed.
Do you have any sort of caching which would create separate sets of logs?
If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just
those from the server. (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache
external images then we don’t need it).
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well.
Do you log hostname?
It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful
to have that turned on now for any future analysis.
Is there anything else we should know?
Best,
{x}
Email for a developer
84. So we might have something that looks like this
85. What can you do with
logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
93. 1. Ask powerful questions
2. Repeatable
3. Scaleable
4. Combine with crawl data
5. Easy to set-up
6. Easy to learn
What do we want from analysing our logs?
145. How long does it take for a page to be discovered after being published?
146. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
147. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
148. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
149. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
150. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
151. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
152. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
153. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
154. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
What percentage of a directory was crawled over the past 30 days?
155. How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
What percentage of a directory was crawled over the past 30 days?
What are the total number of requests across two different time periods?
Start as an actual story
Can i have the house salad please
Greek or lentils
Olives or no olives
Green or black
Stone or no stones
Vinegrette?
Balsamic or Ceaser
Balsamic
Do you want rocket?
I would like a salad
Ask for pii to be removed - how many logs - the dates?
The Good
You can customize for more complicated logging formats
You can use reverse DNS lookup and ASN lookup
You can work with log datasets that are too large to download to your computer
Start as an actual story
Can i have the house salad please
Greek or lentils
Olives or no olives
Green or black
Stone or no stones
Vinegrette?
Balsamic or Ceaser
Balsamic
Do you want rocket?
I would like a salad
This is the summation of years worth of work - i can’t fit it into a 40 min presentation so i put resources here. Dw if you get lost it’s all here