SlideShare a Scribd company logo
Prepare for walls of text.
Server Logs
After Excel Fails
@ohgm
About Me
• Former Senior Technical Consultant @ builtvisible.
• Now Freelance Technical SEO Consultant.
• @ohgm on Twitter.
• ohgm.co.uk for my webzone.
What I’d like to do today
1. Talk about access logs.
2. Show you some command line tools.
3. Show you some ways to apply these tools to
common scenarios.
4. Sit back down.
This talk is on the first significant difficulty spike in
server log analysis – having too much information.
Assumptions.
Assumptions
1. Your client is retaining their logs.
2. You don’t have access to your client’s
server.
What is an Access.log?
ohgm.co.uk 162.158.93.95 - - [11/Apr/2016:10:14:20 +0100] "GET /wmt-crawl-
representative-url-transfer-link-equity/ HTTP/1.1" 200 7976 "-" "Mozilla/5.0
(compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" 162.158.93.95
ohgm.co.uk 108.162.219.171 - - [11/Apr/2016:10:15:07 +0100] "GET /feed/ HTTP/1.1"
200 136953 "-" "Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)"
108.162.219.171
ohgm.co.uk 108.162.219.176 - - [11/Apr/2016:10:22:54 +0100] "GET /wayback-
machine-seo HTTP/1.1" 200 9079 "http://www.traackr.com/" "Traackr.com"
108.162.219.176
ohgm.co.uk 173.245.55.114 - - [11/Apr/2016:10:23:35 +0100] "GET /author/ohgm/
HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" 173.245.55.114
ohgm.co.uk 173.245.55.123 - - [11/Apr/2016:10:23:42 +0100] "GET / HTTP/1.1" 200
6812 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
173.245.55.123
ohgm.co.uk1 173.245.55.1232 - - [11/Apr/2016:10:23:42 +01003]
"GET4 /please5 HTTP/1.16" 2007 68128 "-9" "Mozilla/5.0
(compatible; Googlebot/2.1;
+http://www.google.com/bot.html)10" 173.245.55.12311
1. The host responding to the
request.
2. The IP that serviced the
request.
3. The date and time of the
request.
4. The HTTP method:
GET, POST, PUT, HEAD, or
DELETE.
5. The resource requested.
6. The HTTP Version
{HTTP/1.0|HTTP/1.1|HTTP/2}
7. The server response.
8. The download size.
9. The referring URL.
10. The reported User-Agent.
11. The IP that made the
request.
Configurations vary substantially.
Why SEOs Like Them.
There is a lack of overlap between server logs and
crawl simulation tools.
Access logs show what’s being accessed rather than
what’s simply accessible.
We find correlation between crawl efficiency
improvements and organic performance. Access logs
are one of the best tools for identifying crawl waste.
Why ‘Excel Fails’?
Microsoft Excel currently supports
1,048,576 rows of data.
There are no plans to increase this.
Agency Scenario
Your manager has sold a Server Log Analysis
project, requesting 1 month of access logs
from the client, a UK high street retailer.
You receive 15 access_log.gz files, totalling
17.6GB. Excel won’t open any of them. You
don’t know it yet, but they are unfiltered.
Good Luck.
We also load balance on 6 servers.
“Just use a sample.”
“How do I even get a sample?”
Command Line
Tools
Advantages of Command Line
Tools.
• They’re fast.
• They’re not in the cloud.
• The main limit is you, not a development queue.
Disadvantages of Command Line
Tools.
• They’re scary at first.
• You can delete your computer.
• Don’t delete your computer.
Installation
If you’re on Mac, you’re ready.
If you’re on Linux, you’re ready.
If you’re on Windows, you
probably aren’t ready*.
*Unless ‘Ubuntu on Windows’ becomes
part of the non-developer release.
1. Windows Update > Update Settings >
Advanced > Get Insider Preview
Builds.
2. Install Build 14316 or greater.
3. Enable ‘Windows Subsystem for Linux
(Beta)’.
4. Open cmd and type ‘bash’.
5. Type ‘y’ and hit enter at the prompt.
1. Windows Update > Update Settings >
Advanced > Get Insider Preview
Builds.
2. Install Build 14316 or greater.
3. Enable ‘Windows Subsystem for Linux
(Beta)’.
4. Open cmd and type ‘bash’.
5. Type ‘y’ and hit enter at the prompt.
No
thanks.
Install GNU ON WINDOWS
(or Cygwin) instead.
Installation
Done
Getting Started
• Navigate to the folder
containing the
downloaded files.
• Open your chosen
terminal (cmd,
terminal, or bash).
CTRL+SHIFT+Rclick inside a folder is
an alternate method.
~$ type-commands-here
Then hit enter.
~$ echo hello.
hello.
The Title of The Talk
Was a Lie and We’re
Going to Try to Use
Excel Anyway. Sorry.
Sorry about the walls of text.
Server Logs
Until Excel Fails
@ohgm
Combining
Files.
Combine Multiple Log Files
We navigate to a folder containing all our server logs,
open the terminal, and type:
~$ cat *.log >> combined.log
“Take every .log file in the folder and append each to
combined.log”
But they gave me files in lots of
different folders.
Combine Multiple Files in Multiple
Folders
~$ find . -name ‘*.log’ -exec cat {} >> combined.log ;
“Search the current folder, and all subfolders for
filenames ending with ‘.log’. Append the contents of
these files to a new file called combined.log.”
gfind in GOW
They’re compressed.
Multiple times.
Combine Multiple Files in Multiple
Folders Some of Which are compressed
~$ find . -name *.gz -exec gzip -dkr {} +
&& find . -name ‘*.log’ -exec cat {} >> combined.log ;
“Find all the files with the .gz extension beneath the current
folder.
Recursively Decompress all files. Keep the originals.
Once finished, find all the .log files, append them to a new
combined.log file.”
Preview Huge Files with less
less streams the contents of a file to the terminal
without loading the whole file into memory.
$~ less combined.log
You can use less to review access logs without
crippling your machine.
R.T.F.MREAD THE FRIENDLY MANUAL
RTFM
If at any time you get stuck:
~$ toolname --help
or
~$ man toolname
or
Google what you are trying to do.
The --help ( often -h) flag will usually give you what you need to know.
‘man’ (manual) tends to be much more in depth. Both are read from the
command line.
We now have
one large file.
UA Filtering: Googlebot
Our combined access logs are in a single file:
combined.log – 16.4GB
Too large to open in Excel.
Too large to open in Notepad.
Examining it with less?
It’s too full of filthy human data.
We need to cut it down to Googlebot.
grep is a tool that extracts lines from text files based on a
regular expression. Using grep is pretty simple:
~$ grep [options] [pattern] [file]
…
~$ grep ‘Googlebot’ combined.log
“Give me all the lines containing Googlebot in
combined.log”
Press Enter.
grep
We forgot to store it somewhere.
Filtering Scenario: Googlebot
So we’ll store this output to a new file using ‘>>’:
~$ grep ‘Googlebot’ combined.log >> googlebot.log
“Append all lines in combined.log that contain
Googlebot into a new file, googlebot.log”
Like other tools, grep has a number of optional
argument flags. The count flag ‘-c’ can provide a
useful summary for direct questions:
~$ grep [options] [pattern] [file]
…
~$ grep -c “POST /wp-login” april.log
“Show me the count of login attempts in April on
ohgm.co.uk”
grep
Filtering Scenario – Googlebot
You can’t just verify Googlebot by name.
Apparently some people aren’t honest on the internet.
IP Filtering
Filtering Scenario – IP Ranges
Start End
64.233.160.0 64.233.191.255
66.102.0.0 66.102.15.255
66.249.64.0 66.249.95.255
72.14.192.0 72.14.255.255
74.125.0.0 74.125.255.255
209.85.128.0 209.85.255.255
216.239.32.0 216.239.63.255
If we were masochistic, we could write a
regular expression to capture these all…
Filtering Scenario – IP Ranges
The -E flag lets grep use Extended Regular Expressions.
~$ grep -E "((b(64).233.(1([6-8][0-9]|9[0-
1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-
9]|[7-8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-
9]|5[0-5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-
9]|5[0-5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
GbotUA.log > GbotIP.log
This shouldn’t work, but it does*.
*WOMM
Filtering Scenario – Impostors
The -v flag inverts the grep query to find impostors:
~$ grep -vE "((b(64).233.(1([6-8][0-9]|9[0-
1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-9]|[7-
8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-9]|5[0-
5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0-
5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log >
Impostors.log
“Give me every request that claims to be Googlebot, but
doesn’t come from this IP range. Put them in an impostors
file.”
Filtering Scenario – Verifying
Googlebot
• Disclaimer: don’t blindly use awful regex (check
with Regexr) or IP ranges, especially if you’re
analysing logs for a site using IP detection for
international SEO purposes. Read more about
Googlebot’s Geo-distributed Crawling here first.
• Use the correct reverse DNS > forward DNS lookup
when it’s important to be right. This can be
automated.
Filtering Scenario – IP Ranges
Anyone cloaking today
will have a good list.
You might find them at the bar.
“I Just Want
A Sample.”
The file is still too big.
I Want A Sample
The sort and split utilities do what you’d expect:
~$ sort -R combined.log | split -l 1048576
“Randomly sort the lines in the combined.log. split
the output of this command into multiple files (up to)
1048576 lines long.”
shuf is easier, but not default OSX/GOW.
A pipe ‘|’ takes the output of one
command as the input of another.
“I Just Want it
in Excel.”
I Just Want it in Excel.
Use wc to check it has fewer than 1,048,576 rows.
~$ wc -l sample.log
“Count the number of lines in sample.log.”
The Title of The Talk
Wasn’t a Lie And We
Aren’t Going to Use Excel
And Are Going to Answer
Questions Just Using The
Command Line I Hope
That’s OK. Sorry.
Asking Useful
Questions
Formulating Questions
Work a basic hypothesis. Decides what needs to be
done if it is true, false, or indeterminate before you
get the data.
“Google is ignoring robots.txt” may not be action
guiding, whilst “Googlebot is ignoring search console
parameter restrictions” just might be.
Formulating Questions
Some things just
aren’t very useful to
know.
Example Questions
How deep is Googlebot crawling?
Where is the wasted crawl? What proportion of
requests are currently wasted?
Where is Googlebot POSTing?
What are the most popular non-200/304 resources?
How many unique resources are being crawled?
Which is the more popular form of product page?
Which sitemap pages aren’t being crawled?
Always pivot with other data.
Getting Useful
Answers
AWK
AWK is a programming language focused on text
manipulation.
We are going to use it to print some columns from
our log files. That’s it.
Logs are space separated by default.
Awk uses spaces to define column numbers.
~$ awk ‘ {print $col_number1, $col_number2}’ [file]
ohgm.co.uk1
173.245.55.1232-3-4
[11/Apr/2016:10:23:425+0100]6
"GET7/8HTTP/1.1”920010681211“”12
"Mozilla/5.013(compatible;14Googlebot/2.1;15+http://ww
w.google.com/bot.html)“16173.245.55.123
AWK
~$ awk ‘{print $8, $10}’ Googlebot.log >> Gbot_responses.txt
“Output the requested resource and server response of
Googlebot.log to Gbot_responses.txt.”
/ 200
/robots.txt 304
/robots.txt 500
/amazing-blog-post 200
/forgotten-blog-post 404
/forbidden-blog-post 403
/ 200
Tailor the command to the access log format in use.
uniq
uniq takes text as an input and returns unique lines.
uniq -c returns these lines prefixed with a count.
uniq -d returns only repeated lines.
uniq -u returns only non-repeated lines.
AWK
Like grep, awk also matches patterns, using /pattern/.
~$ awk ‘/bingbot/ {print $10}’ combined.log | uniq -c
“Look for lines containing bingbot in the unfiltered logs and
print their server response codes. Deduplicate these and
return a summary.”
216663 - 302
109232 - 200
18395 - 301
2568 - 404
2147 - 304
274 - 500
261 - 403
Example Use:
Site Migrations
Ultimate Guide to Site Migrations
Get a big list of old URLs.
301 redirect them once to the
right places.
Make sure they get crawled.
Site Migrations
“We want a list of all URLs requested by Googlebot in
our pre-migration dataset, sorted by popularity
(number of requests).”
e.g.
/ 49587
/index.html 25169
/robots.txt 23334
/home 19417
Site Migrations
~$ awk ‘/Googlebot/ {print $7}’ combined.log |
uniq -c | sort -nr >> unique_requests.txt
“Take all access log requests, and filter to Googlebot.
Extract and output the requested resources.
Deduplicate these and return a summary.
Sort these by number in descending order.”
Site Migrations – Encouraging
Crawl
“We want our migration to switch as
quickly as possible.”
Get the list of redirects (URI stems) you want Google
to crawl into a file.
grep can use this file as the match criteria (lines
matching this OR this OR this).
Site Migrations – Encouraging
Crawl
We want the URLs Google has not yet crawled.
~$ grep -f wishlist.txt postmigration.log | awk
‘/Googlebot/ {print $8}’ | uniq >> wishlist-hits.txt
“Filter the post-migration log for lines that match
wishlist.txt. Return the resources requested by Googlebot.
Deduplicate and save.”
Site Migrations – Encouraging
Crawl
~$ cat wishlist-hits.txt wishlist.txt | uniq -u >>
uncrawled.txt
“Read the contents of both files. Save wishlist
entries that don’t appear in the access logs.”
Tip: use an indexing service like linklicious to encourage
crawling the uncrawled.
Taking This
Further
Keep Learning
Unix Utilities.
Learn SQL.
Also
These Techniques Apply to Other
SEO Activities.
Enterprise Link Audits.
Enterprise Keyword Research.
Enterprise Spamming.
Thanks.
Oliver Mason
Technical SEO Consultant
Twitter: @ohgm
Email: ohgm@ohgm.co.uk
Resources
GOW: https://github.com/bmatzelle/gow
Cygwin: http://cygwin.com/install.html
Command Line Crash Course:
http://cli.learncodethehardway.org/book/
Shameless links to my own stuff:
http://ohgm.co.uk/filter-server-logs-to-googlebot/
http://ohgm.co.uk/watch-googlebot-crawling/
http://ohgm.co.uk/preserve-link-equity-with-file-aliasing/
http://ohgm.co.uk/wayback-machine-seo/
Tools Used in this Talk
grep
sort
split
shuf
find
uniq
awk
wc
cowsay

More Related Content

What's hot

Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET Core
Stormpath
 
Webinar: MongoDB Connector for Spark
Webinar: MongoDB Connector for SparkWebinar: MongoDB Connector for Spark
Webinar: MongoDB Connector for Spark
MongoDB
 
Android Lab Test : Using the network with HTTP (english)
Android Lab Test : Using the network with HTTP (english)Android Lab Test : Using the network with HTTP (english)
Android Lab Test : Using the network with HTTP (english)
Bruno Delb
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
Animesh Shaw
 
Android and REST
Android and RESTAndroid and REST
Android and REST
Roman Woźniak
 
Rest services caching
Rest services cachingRest services caching
Rest services caching
Sperasoft
 
Django rest framework
Django rest frameworkDjango rest framework
Django rest framework
Blank Chen
 
Multi-threaded web crawler in Ruby
Multi-threaded web crawler in RubyMulti-threaded web crawler in Ruby
Multi-threaded web crawler in Ruby
Polcode
 
Combining Django REST framework & Elasticsearch
Combining Django REST framework & ElasticsearchCombining Django REST framework & Elasticsearch
Combining Django REST framework & Elasticsearch
Yaroslav Muravskyi
 
Understanding and testing restful web services
Understanding and testing restful web servicesUnderstanding and testing restful web services
Understanding and testing restful web services
mwinteringham
 
Understanding REST
Understanding RESTUnderstanding REST
Understanding REST
Nitin Pande
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
Amir Masoud Sefidian
 
The never-ending REST API design debate
The never-ending REST API design debateThe never-ending REST API design debate
The never-ending REST API design debate
Restlet
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
Sanchit Saini
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
Partnered Health
 
Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET Core
Nate Barbettini
 
Gitminer 2.0 - Advance Search on Github
Gitminer 2.0 - Advance Search on GithubGitminer 2.0 - Advance Search on Github
Gitminer 2.0 - Advance Search on Github
Nullbyte Security Conference
 
Controlling crawler for better Indexation and Ranking
Controlling crawler for better Indexation and RankingControlling crawler for better Indexation and Ranking
Controlling crawler for better Indexation and Ranking
Rajesh Magar
 

What's hot (20)

Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET Core
 
Webinar: MongoDB Connector for Spark
Webinar: MongoDB Connector for SparkWebinar: MongoDB Connector for Spark
Webinar: MongoDB Connector for Spark
 
Android Lab Test : Using the network with HTTP (english)
Android Lab Test : Using the network with HTTP (english)Android Lab Test : Using the network with HTTP (english)
Android Lab Test : Using the network with HTTP (english)
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
 
Android and REST
Android and RESTAndroid and REST
Android and REST
 
Google Hacking Basics
Google Hacking BasicsGoogle Hacking Basics
Google Hacking Basics
 
Rest services caching
Rest services cachingRest services caching
Rest services caching
 
Django rest framework
Django rest frameworkDjango rest framework
Django rest framework
 
Multi-threaded web crawler in Ruby
Multi-threaded web crawler in RubyMulti-threaded web crawler in Ruby
Multi-threaded web crawler in Ruby
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Combining Django REST framework & Elasticsearch
Combining Django REST framework & ElasticsearchCombining Django REST framework & Elasticsearch
Combining Django REST framework & Elasticsearch
 
Understanding and testing restful web services
Understanding and testing restful web servicesUnderstanding and testing restful web services
Understanding and testing restful web services
 
Understanding REST
Understanding RESTUnderstanding REST
Understanding REST
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
The never-ending REST API design debate
The never-ending REST API design debateThe never-ending REST API design debate
The never-ending REST API design debate
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
Building Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET CoreBuilding Beautiful REST APIs in ASP.NET Core
Building Beautiful REST APIs in ASP.NET Core
 
Gitminer 2.0 - Advance Search on Github
Gitminer 2.0 - Advance Search on GithubGitminer 2.0 - Advance Search on Github
Gitminer 2.0 - Advance Search on Github
 
Controlling crawler for better Indexation and Ranking
Controlling crawler for better Indexation and RankingControlling crawler for better Indexation and Ranking
Controlling crawler for better Indexation and Ranking
 

Viewers also liked

Server Log Files & Technical SEO Audits: What You Need to Know
Server Log Files & Technical SEO Audits: What You Need to KnowServer Log Files & Technical SEO Audits: What You Need to Know
Server Log Files & Technical SEO Audits: What You Need to Know
Samuel Scott
 
BrightonSEO - Biddable Media Session
BrightonSEO - Biddable Media SessionBrightonSEO - Biddable Media Session
BrightonSEO - Biddable Media Session
Saija Mahon
 
7 tools to turbo boost your SEO in 2017
7 tools to turbo boost your SEO in 20177 tools to turbo boost your SEO in 2017
7 tools to turbo boost your SEO in 2017
Mike Vigar
 
In Pursuit of UGC: The Power of Genuine Reviews
In Pursuit of UGC: The Power of Genuine ReviewsIn Pursuit of UGC: The Power of Genuine Reviews
In Pursuit of UGC: The Power of Genuine Reviews
Feefo
 
Brighton SEO - What It's Like Having GA Premium
Brighton SEO - What It's Like Having GA PremiumBrighton SEO - What It's Like Having GA Premium
Brighton SEO - What It's Like Having GA Premium
Arianne Donoghue
 
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
Authoritas
 
ADN Antreprenor 2016 - Felix Tataru, GMP
ADN Antreprenor 2016 - Felix Tataru, GMPADN Antreprenor 2016 - Felix Tataru, GMP
ADN Antreprenor 2016 - Felix Tataru, GMP
Gabriel Barliga
 
Social Media & Networking - The Evolving Workforce
Social Media & Networking - The Evolving WorkforceSocial Media & Networking - The Evolving Workforce
Social Media & Networking - The Evolving WorkforceTodd Wheatland
 
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
Towards a Grand Unified Theory of Systems Engineering (GUTSE)Towards a Grand Unified Theory of Systems Engineering (GUTSE)
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
Joseph KAsser
 
How to Find a Good Work-Life Balance
How to Find a Good Work-Life Balance How to Find a Good Work-Life Balance
How to Find a Good Work-Life Balance
Wording Well
 
Mobile application development |#Mobileapplicationdevelopment
Mobile application development |#MobileapplicationdevelopmentMobile application development |#Mobileapplicationdevelopment
Mobile application development |#Mobileapplicationdevelopment
Mobile App Developers India
 
Seguridad de la Informacion
Seguridad de la InformacionSeguridad de la Informacion
Seguridad de la Informacion
NFAG DTLF
 
Local Government Balances Security, Flexibility and Productivity with BlackBe...
Local Government Balances Security, Flexibility and Productivity with BlackBe...Local Government Balances Security, Flexibility and Productivity with BlackBe...
Local Government Balances Security, Flexibility and Productivity with BlackBe...
BlackBerry
 
M2 t1 planificador_aamtic version final numeral 5
M2 t1 planificador_aamtic  version final numeral 5M2 t1 planificador_aamtic  version final numeral 5
M2 t1 planificador_aamtic version final numeral 5
Polo Apolo
 
Through the Lens of an iPhone: Charleston, SC
Through the Lens of an iPhone: Charleston, SCThrough the Lens of an iPhone: Charleston, SC
Through the Lens of an iPhone: Charleston, SC
Paul Brown
 
18 TIPS TO-BE FOUNDERS
18 TIPS TO-BE FOUNDERS18 TIPS TO-BE FOUNDERS
18 TIPS TO-BE FOUNDERSAndre Marquet
 
6 Healthy Mother's Day Gifts
6 Healthy Mother's Day Gifts6 Healthy Mother's Day Gifts
6 Healthy Mother's Day Gifts
Eason Chan
 
Perunapuu esittely 4
Perunapuu esittely 4Perunapuu esittely 4
Perunapuu esittely 4
Miikka Leinonen
 

Viewers also liked (20)

Server Log Files & Technical SEO Audits: What You Need to Know
Server Log Files & Technical SEO Audits: What You Need to KnowServer Log Files & Technical SEO Audits: What You Need to Know
Server Log Files & Technical SEO Audits: What You Need to Know
 
BrightonSEO - Biddable Media Session
BrightonSEO - Biddable Media SessionBrightonSEO - Biddable Media Session
BrightonSEO - Biddable Media Session
 
7 tools to turbo boost your SEO in 2017
7 tools to turbo boost your SEO in 20177 tools to turbo boost your SEO in 2017
7 tools to turbo boost your SEO in 2017
 
In Pursuit of UGC: The Power of Genuine Reviews
In Pursuit of UGC: The Power of Genuine ReviewsIn Pursuit of UGC: The Power of Genuine Reviews
In Pursuit of UGC: The Power of Genuine Reviews
 
Brighton SEO - What It's Like Having GA Premium
Brighton SEO - What It's Like Having GA PremiumBrighton SEO - What It's Like Having GA Premium
Brighton SEO - What It's Like Having GA Premium
 
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
 
ADN Antreprenor 2016 - Felix Tataru, GMP
ADN Antreprenor 2016 - Felix Tataru, GMPADN Antreprenor 2016 - Felix Tataru, GMP
ADN Antreprenor 2016 - Felix Tataru, GMP
 
Social Media & Networking - The Evolving Workforce
Social Media & Networking - The Evolving WorkforceSocial Media & Networking - The Evolving Workforce
Social Media & Networking - The Evolving Workforce
 
Research Design
Research DesignResearch Design
Research Design
 
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
Towards a Grand Unified Theory of Systems Engineering (GUTSE)Towards a Grand Unified Theory of Systems Engineering (GUTSE)
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
 
SFR Certification
SFR CertificationSFR Certification
SFR Certification
 
How to Find a Good Work-Life Balance
How to Find a Good Work-Life Balance How to Find a Good Work-Life Balance
How to Find a Good Work-Life Balance
 
Mobile application development |#Mobileapplicationdevelopment
Mobile application development |#MobileapplicationdevelopmentMobile application development |#Mobileapplicationdevelopment
Mobile application development |#Mobileapplicationdevelopment
 
Seguridad de la Informacion
Seguridad de la InformacionSeguridad de la Informacion
Seguridad de la Informacion
 
Local Government Balances Security, Flexibility and Productivity with BlackBe...
Local Government Balances Security, Flexibility and Productivity with BlackBe...Local Government Balances Security, Flexibility and Productivity with BlackBe...
Local Government Balances Security, Flexibility and Productivity with BlackBe...
 
M2 t1 planificador_aamtic version final numeral 5
M2 t1 planificador_aamtic  version final numeral 5M2 t1 planificador_aamtic  version final numeral 5
M2 t1 planificador_aamtic version final numeral 5
 
Through the Lens of an iPhone: Charleston, SC
Through the Lens of an iPhone: Charleston, SCThrough the Lens of an iPhone: Charleston, SC
Through the Lens of an iPhone: Charleston, SC
 
18 TIPS TO-BE FOUNDERS
18 TIPS TO-BE FOUNDERS18 TIPS TO-BE FOUNDERS
18 TIPS TO-BE FOUNDERS
 
6 Healthy Mother's Day Gifts
6 Healthy Mother's Day Gifts6 Healthy Mother's Day Gifts
6 Healthy Mother's Day Gifts
 
Perunapuu esittely 4
Perunapuu esittely 4Perunapuu esittely 4
Perunapuu esittely 4
 

Similar to Server Logs: After Excel Fails

Logstash
LogstashLogstash
Logstash
琛琳 饶
 
Linux intermediate level
Linux intermediate levelLinux intermediate level
Linux intermediate level
Madhavendra Dutt
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
jnewmanux
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
Kyle Banerjee
 
Go. Why it goes
Go. Why it goesGo. Why it goes
Go. Why it goes
Sergey Pichkurov
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
The Ring programming language version 1.5.3 book - Part 7 of 184
The Ring programming language version 1.5.3 book - Part 7 of 184The Ring programming language version 1.5.3 book - Part 7 of 184
The Ring programming language version 1.5.3 book - Part 7 of 184
Mahmoud Samir Fayed
 
Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!
mold
 
Boulder_OneStop_presentation
Boulder_OneStop_presentationBoulder_OneStop_presentation
Boulder_OneStop_presentationThomas Jaensch
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
Philip Norton
 
Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4
Ilya Haykinson
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010Clay Helberg
 
The Ring programming language version 1.5.4 book - Part 7 of 185
The Ring programming language version 1.5.4 book - Part 7 of 185The Ring programming language version 1.5.4 book - Part 7 of 185
The Ring programming language version 1.5.4 book - Part 7 of 185
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.1 book - Part 6 of 180
The Ring programming language version 1.5.1 book - Part 6 of 180The Ring programming language version 1.5.1 book - Part 6 of 180
The Ring programming language version 1.5.1 book - Part 6 of 180
Mahmoud Samir Fayed
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
Kyle Banerjee
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
Andres Baravalle
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
All Things Open
 

Similar to Server Logs: After Excel Fails (20)

Logstash
LogstashLogstash
Logstash
 
Linux intermediate level
Linux intermediate levelLinux intermediate level
Linux intermediate level
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Go. Why it goes
Go. Why it goesGo. Why it goes
Go. Why it goes
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
The Ring programming language version 1.5.3 book - Part 7 of 184
The Ring programming language version 1.5.3 book - Part 7 of 184The Ring programming language version 1.5.3 book - Part 7 of 184
The Ring programming language version 1.5.3 book - Part 7 of 184
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!
 
Boulder_OneStop_presentation
Boulder_OneStop_presentationBoulder_OneStop_presentation
Boulder_OneStop_presentation
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
 
Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010
 
The Ring programming language version 1.5.4 book - Part 7 of 185
The Ring programming language version 1.5.4 book - Part 7 of 185The Ring programming language version 1.5.4 book - Part 7 of 185
The Ring programming language version 1.5.4 book - Part 7 of 185
 
The Ring programming language version 1.5.1 book - Part 6 of 180
The Ring programming language version 1.5.1 book - Part 6 of 180The Ring programming language version 1.5.1 book - Part 6 of 180
The Ring programming language version 1.5.1 book - Part 6 of 180
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 

Recently uploaded

Winning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis YuWinning local SEO in the Age of AI - Dennis Yu
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User JourneysMastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
Search Engine Journal
 
How to Run Landing Page Tests On and Off Paid Social Platforms
How to Run Landing Page Tests On and Off Paid Social PlatformsHow to Run Landing Page Tests On and Off Paid Social Platforms
How to Run Landing Page Tests On and Off Paid Social Platforms
VWO
 
Digital Marketing Training In Bangalore
Digital Marketing Training In BangaloreDigital Marketing Training In Bangalore
Digital Marketing Training In Bangalore
syedasifsyed46
 
10 Videos Any Business Can Make Right Now! - Shelly Nathan
10 Videos Any Business Can Make Right Now! - Shelly Nathan10 Videos Any Business Can Make Right Now! - Shelly Nathan
10 Videos Any Business Can Make Right Now! - Shelly Nathan
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Winning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis YuWinning local SEO in the Age of AI - Dennis Yu
Marketing as a Primary Revenue Driver - Lee Levitt
Marketing as a Primary Revenue Driver - Lee LevittMarketing as a Primary Revenue Driver - Lee Levitt
FullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil PallenFullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil Pallen
travisomalana
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
offisadizayn
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Core Web Vitals SEO Workshop - improve your performance [pdf]
Core Web Vitals SEO Workshop - improve your performance [pdf]Core Web Vitals SEO Workshop - improve your performance [pdf]
Core Web Vitals SEO Workshop - improve your performance [pdf]
Peter Mead
 
Top 3 Ways to Align Sales and Marketing Teams for Rapid Growth
Top 3 Ways to Align Sales and Marketing Teams for Rapid GrowthTop 3 Ways to Align Sales and Marketing Teams for Rapid Growth
Top 3 Ways to Align Sales and Marketing Teams for Rapid Growth
Demandbase
 
De-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah GrapDe-risk Your Digital Evolution - Hannah Grap
A Guide to UK Top Search Engine Optimization
A Guide to UK Top Search Engine OptimizationA Guide to UK Top Search Engine Optimization
A Guide to UK Top Search Engine Optimization
Brand Highlighters
 
15 ideas and frameworks on the art of storytelling
15 ideas and frameworks on the art of storytelling15 ideas and frameworks on the art of storytelling
15 ideas and frameworks on the art of storytelling
Aatir Abdul Rauf
 
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
levuag
 
Coca Cola Branding Strategy and strategic marketing plan
Coca Cola Branding Strategy and strategic marketing planCoca Cola Branding Strategy and strategic marketing plan
Coca Cola Branding Strategy and strategic marketing plan
Maswer Ali
 
Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024
Andy Lambert
 
ThinkNow 2024 Consumer Financial Wellness Report
ThinkNow 2024 Consumer Financial Wellness ReportThinkNow 2024 Consumer Financial Wellness Report
ThinkNow 2024 Consumer Financial Wellness Report
ThinkNow
 
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny LeibrandtThe New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 

Recently uploaded (20)

Winning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis YuWinning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis Yu
 
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User JourneysMastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
Mastering Multi-Touchpoint Content Strategy: Navigate Fragmented User Journeys
 
How to Run Landing Page Tests On and Off Paid Social Platforms
How to Run Landing Page Tests On and Off Paid Social PlatformsHow to Run Landing Page Tests On and Off Paid Social Platforms
How to Run Landing Page Tests On and Off Paid Social Platforms
 
Digital Marketing Training In Bangalore
Digital Marketing Training In BangaloreDigital Marketing Training In Bangalore
Digital Marketing Training In Bangalore
 
10 Videos Any Business Can Make Right Now! - Shelly Nathan
10 Videos Any Business Can Make Right Now! - Shelly Nathan10 Videos Any Business Can Make Right Now! - Shelly Nathan
10 Videos Any Business Can Make Right Now! - Shelly Nathan
 
Winning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis YuWinning local SEO in the Age of AI - Dennis Yu
Winning local SEO in the Age of AI - Dennis Yu
 
Marketing as a Primary Revenue Driver - Lee Levitt
Marketing as a Primary Revenue Driver - Lee LevittMarketing as a Primary Revenue Driver - Lee Levitt
Marketing as a Primary Revenue Driver - Lee Levitt
 
FullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil PallenFullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil Pallen
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
 
Core Web Vitals SEO Workshop - improve your performance [pdf]
Core Web Vitals SEO Workshop - improve your performance [pdf]Core Web Vitals SEO Workshop - improve your performance [pdf]
Core Web Vitals SEO Workshop - improve your performance [pdf]
 
Top 3 Ways to Align Sales and Marketing Teams for Rapid Growth
Top 3 Ways to Align Sales and Marketing Teams for Rapid GrowthTop 3 Ways to Align Sales and Marketing Teams for Rapid Growth
Top 3 Ways to Align Sales and Marketing Teams for Rapid Growth
 
De-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah GrapDe-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah Grap
 
A Guide to UK Top Search Engine Optimization
A Guide to UK Top Search Engine OptimizationA Guide to UK Top Search Engine Optimization
A Guide to UK Top Search Engine Optimization
 
15 ideas and frameworks on the art of storytelling
15 ideas and frameworks on the art of storytelling15 ideas and frameworks on the art of storytelling
15 ideas and frameworks on the art of storytelling
 
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
20221005110010_633d63baa84f6_learn___week_3_ch._5.pdf
 
Coca Cola Branding Strategy and strategic marketing plan
Coca Cola Branding Strategy and strategic marketing planCoca Cola Branding Strategy and strategic marketing plan
Coca Cola Branding Strategy and strategic marketing plan
 
Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024
 
ThinkNow 2024 Consumer Financial Wellness Report
ThinkNow 2024 Consumer Financial Wellness ReportThinkNow 2024 Consumer Financial Wellness Report
ThinkNow 2024 Consumer Financial Wellness Report
 
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny LeibrandtThe New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
 

Server Logs: After Excel Fails

  • 1. Prepare for walls of text. Server Logs After Excel Fails @ohgm
  • 2. About Me • Former Senior Technical Consultant @ builtvisible. • Now Freelance Technical SEO Consultant. • @ohgm on Twitter. • ohgm.co.uk for my webzone.
  • 3. What I’d like to do today 1. Talk about access logs. 2. Show you some command line tools. 3. Show you some ways to apply these tools to common scenarios. 4. Sit back down. This talk is on the first significant difficulty spike in server log analysis – having too much information.
  • 5. Assumptions 1. Your client is retaining their logs. 2. You don’t have access to your client’s server.
  • 6. What is an Access.log?
  • 7.
  • 8. ohgm.co.uk 162.158.93.95 - - [11/Apr/2016:10:14:20 +0100] "GET /wmt-crawl- representative-url-transfer-link-equity/ HTTP/1.1" 200 7976 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" 162.158.93.95 ohgm.co.uk 108.162.219.171 - - [11/Apr/2016:10:15:07 +0100] "GET /feed/ HTTP/1.1" 200 136953 "-" "Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)" 108.162.219.171 ohgm.co.uk 108.162.219.176 - - [11/Apr/2016:10:22:54 +0100] "GET /wayback- machine-seo HTTP/1.1" 200 9079 "http://www.traackr.com/" "Traackr.com" 108.162.219.176 ohgm.co.uk 173.245.55.114 - - [11/Apr/2016:10:23:35 +0100] "GET /author/ohgm/ HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 173.245.55.114 ohgm.co.uk 173.245.55.123 - - [11/Apr/2016:10:23:42 +0100] "GET / HTTP/1.1" 200 6812 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 173.245.55.123
  • 9. ohgm.co.uk1 173.245.55.1232 - - [11/Apr/2016:10:23:42 +01003] "GET4 /please5 HTTP/1.16" 2007 68128 "-9" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)10" 173.245.55.12311 1. The host responding to the request. 2. The IP that serviced the request. 3. The date and time of the request. 4. The HTTP method: GET, POST, PUT, HEAD, or DELETE. 5. The resource requested. 6. The HTTP Version {HTTP/1.0|HTTP/1.1|HTTP/2} 7. The server response. 8. The download size. 9. The referring URL. 10. The reported User-Agent. 11. The IP that made the request. Configurations vary substantially.
  • 10. Why SEOs Like Them. There is a lack of overlap between server logs and crawl simulation tools. Access logs show what’s being accessed rather than what’s simply accessible. We find correlation between crawl efficiency improvements and organic performance. Access logs are one of the best tools for identifying crawl waste.
  • 11. Why ‘Excel Fails’? Microsoft Excel currently supports 1,048,576 rows of data. There are no plans to increase this.
  • 12. Agency Scenario Your manager has sold a Server Log Analysis project, requesting 1 month of access logs from the client, a UK high street retailer. You receive 15 access_log.gz files, totalling 17.6GB. Excel won’t open any of them. You don’t know it yet, but they are unfiltered. Good Luck.
  • 13. We also load balance on 6 servers.
  • 14.
  • 15. “Just use a sample.”
  • 16. “How do I even get a sample?”
  • 18.
  • 19. Advantages of Command Line Tools. • They’re fast. • They’re not in the cloud. • The main limit is you, not a development queue.
  • 20. Disadvantages of Command Line Tools. • They’re scary at first. • You can delete your computer. • Don’t delete your computer.
  • 22. If you’re on Mac, you’re ready.
  • 23. If you’re on Linux, you’re ready.
  • 24. If you’re on Windows, you probably aren’t ready*. *Unless ‘Ubuntu on Windows’ becomes part of the non-developer release.
  • 25. 1. Windows Update > Update Settings > Advanced > Get Insider Preview Builds. 2. Install Build 14316 or greater. 3. Enable ‘Windows Subsystem for Linux (Beta)’. 4. Open cmd and type ‘bash’. 5. Type ‘y’ and hit enter at the prompt.
  • 26. 1. Windows Update > Update Settings > Advanced > Get Insider Preview Builds. 2. Install Build 14316 or greater. 3. Enable ‘Windows Subsystem for Linux (Beta)’. 4. Open cmd and type ‘bash’. 5. Type ‘y’ and hit enter at the prompt. No thanks.
  • 27. Install GNU ON WINDOWS (or Cygwin) instead.
  • 28.
  • 30. Getting Started • Navigate to the folder containing the downloaded files. • Open your chosen terminal (cmd, terminal, or bash). CTRL+SHIFT+Rclick inside a folder is an alternate method.
  • 31.
  • 34. The Title of The Talk Was a Lie and We’re Going to Try to Use Excel Anyway. Sorry.
  • 35. Sorry about the walls of text. Server Logs Until Excel Fails @ohgm
  • 37. Combine Multiple Log Files We navigate to a folder containing all our server logs, open the terminal, and type: ~$ cat *.log >> combined.log “Take every .log file in the folder and append each to combined.log”
  • 38. But they gave me files in lots of different folders.
  • 39. Combine Multiple Files in Multiple Folders ~$ find . -name ‘*.log’ -exec cat {} >> combined.log ; “Search the current folder, and all subfolders for filenames ending with ‘.log’. Append the contents of these files to a new file called combined.log.” gfind in GOW
  • 41. Combine Multiple Files in Multiple Folders Some of Which are compressed ~$ find . -name *.gz -exec gzip -dkr {} + && find . -name ‘*.log’ -exec cat {} >> combined.log ; “Find all the files with the .gz extension beneath the current folder. Recursively Decompress all files. Keep the originals. Once finished, find all the .log files, append them to a new combined.log file.”
  • 42. Preview Huge Files with less less streams the contents of a file to the terminal without loading the whole file into memory. $~ less combined.log You can use less to review access logs without crippling your machine.
  • 43.
  • 45. RTFM If at any time you get stuck: ~$ toolname --help or ~$ man toolname or Google what you are trying to do. The --help ( often -h) flag will usually give you what you need to know. ‘man’ (manual) tends to be much more in depth. Both are read from the command line.
  • 46. We now have one large file.
  • 47. UA Filtering: Googlebot Our combined access logs are in a single file: combined.log – 16.4GB Too large to open in Excel. Too large to open in Notepad. Examining it with less? It’s too full of filthy human data. We need to cut it down to Googlebot.
  • 48. grep is a tool that extracts lines from text files based on a regular expression. Using grep is pretty simple: ~$ grep [options] [pattern] [file] … ~$ grep ‘Googlebot’ combined.log “Give me all the lines containing Googlebot in combined.log” Press Enter. grep
  • 49. We forgot to store it somewhere.
  • 50. Filtering Scenario: Googlebot So we’ll store this output to a new file using ‘>>’: ~$ grep ‘Googlebot’ combined.log >> googlebot.log “Append all lines in combined.log that contain Googlebot into a new file, googlebot.log”
  • 51. Like other tools, grep has a number of optional argument flags. The count flag ‘-c’ can provide a useful summary for direct questions: ~$ grep [options] [pattern] [file] … ~$ grep -c “POST /wp-login” april.log “Show me the count of login attempts in April on ohgm.co.uk” grep
  • 52.
  • 53. Filtering Scenario – Googlebot You can’t just verify Googlebot by name. Apparently some people aren’t honest on the internet.
  • 55. Filtering Scenario – IP Ranges Start End 64.233.160.0 64.233.191.255 66.102.0.0 66.102.15.255 66.249.64.0 66.249.95.255 72.14.192.0 72.14.255.255 74.125.0.0 74.125.255.255 209.85.128.0 209.85.255.255 216.239.32.0 216.239.63.255 If we were masochistic, we could write a regular expression to capture these all…
  • 56. Filtering Scenario – IP Ranges The -E flag lets grep use Extended Regular Expressions. ~$ grep -E "((b(64).233.(1([6-8][0-9]|9[0- 1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4- 9]|[7-8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0- 9]|5[0-5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0- 9]|5[0-5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log > GbotIP.log This shouldn’t work, but it does*. *WOMM
  • 57. Filtering Scenario – Impostors The -v flag inverts the grep query to find impostors: ~$ grep -vE "((b(64).233.(1([6-8][0-9]|9[0- 1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-9]|[7- 8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-9]|5[0- 5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0- 9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0- 5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0- 9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log > Impostors.log “Give me every request that claims to be Googlebot, but doesn’t come from this IP range. Put them in an impostors file.”
  • 58. Filtering Scenario – Verifying Googlebot • Disclaimer: don’t blindly use awful regex (check with Regexr) or IP ranges, especially if you’re analysing logs for a site using IP detection for international SEO purposes. Read more about Googlebot’s Geo-distributed Crawling here first. • Use the correct reverse DNS > forward DNS lookup when it’s important to be right. This can be automated.
  • 59. Filtering Scenario – IP Ranges Anyone cloaking today will have a good list. You might find them at the bar.
  • 60. “I Just Want A Sample.” The file is still too big.
  • 61.
  • 62. I Want A Sample The sort and split utilities do what you’d expect: ~$ sort -R combined.log | split -l 1048576 “Randomly sort the lines in the combined.log. split the output of this command into multiple files (up to) 1048576 lines long.” shuf is easier, but not default OSX/GOW. A pipe ‘|’ takes the output of one command as the input of another.
  • 63. “I Just Want it in Excel.”
  • 64.
  • 65. I Just Want it in Excel. Use wc to check it has fewer than 1,048,576 rows. ~$ wc -l sample.log “Count the number of lines in sample.log.”
  • 66.
  • 67.
  • 68.
  • 69.
  • 70. The Title of The Talk Wasn’t a Lie And We Aren’t Going to Use Excel And Are Going to Answer Questions Just Using The Command Line I Hope That’s OK. Sorry.
  • 72. Formulating Questions Work a basic hypothesis. Decides what needs to be done if it is true, false, or indeterminate before you get the data. “Google is ignoring robots.txt” may not be action guiding, whilst “Googlebot is ignoring search console parameter restrictions” just might be.
  • 73. Formulating Questions Some things just aren’t very useful to know.
  • 74. Example Questions How deep is Googlebot crawling? Where is the wasted crawl? What proportion of requests are currently wasted? Where is Googlebot POSTing? What are the most popular non-200/304 resources? How many unique resources are being crawled? Which is the more popular form of product page? Which sitemap pages aren’t being crawled? Always pivot with other data.
  • 76. AWK AWK is a programming language focused on text manipulation. We are going to use it to print some columns from our log files. That’s it.
  • 77. Logs are space separated by default. Awk uses spaces to define column numbers. ~$ awk ‘ {print $col_number1, $col_number2}’ [file] ohgm.co.uk1 173.245.55.1232-3-4 [11/Apr/2016:10:23:425+0100]6 "GET7/8HTTP/1.1”920010681211“”12 "Mozilla/5.013(compatible;14Googlebot/2.1;15+http://ww w.google.com/bot.html)“16173.245.55.123
  • 78. AWK ~$ awk ‘{print $8, $10}’ Googlebot.log >> Gbot_responses.txt “Output the requested resource and server response of Googlebot.log to Gbot_responses.txt.” / 200 /robots.txt 304 /robots.txt 500 /amazing-blog-post 200 /forgotten-blog-post 404 /forbidden-blog-post 403 / 200 Tailor the command to the access log format in use.
  • 79. uniq uniq takes text as an input and returns unique lines. uniq -c returns these lines prefixed with a count. uniq -d returns only repeated lines. uniq -u returns only non-repeated lines.
  • 80. AWK Like grep, awk also matches patterns, using /pattern/. ~$ awk ‘/bingbot/ {print $10}’ combined.log | uniq -c “Look for lines containing bingbot in the unfiltered logs and print their server response codes. Deduplicate these and return a summary.” 216663 - 302 109232 - 200 18395 - 301 2568 - 404 2147 - 304 274 - 500 261 - 403
  • 82. Ultimate Guide to Site Migrations Get a big list of old URLs. 301 redirect them once to the right places. Make sure they get crawled.
  • 83. Site Migrations “We want a list of all URLs requested by Googlebot in our pre-migration dataset, sorted by popularity (number of requests).” e.g. / 49587 /index.html 25169 /robots.txt 23334 /home 19417
  • 84. Site Migrations ~$ awk ‘/Googlebot/ {print $7}’ combined.log | uniq -c | sort -nr >> unique_requests.txt “Take all access log requests, and filter to Googlebot. Extract and output the requested resources. Deduplicate these and return a summary. Sort these by number in descending order.”
  • 85. Site Migrations – Encouraging Crawl “We want our migration to switch as quickly as possible.” Get the list of redirects (URI stems) you want Google to crawl into a file. grep can use this file as the match criteria (lines matching this OR this OR this).
  • 86. Site Migrations – Encouraging Crawl We want the URLs Google has not yet crawled. ~$ grep -f wishlist.txt postmigration.log | awk ‘/Googlebot/ {print $8}’ | uniq >> wishlist-hits.txt “Filter the post-migration log for lines that match wishlist.txt. Return the resources requested by Googlebot. Deduplicate and save.”
  • 87. Site Migrations – Encouraging Crawl ~$ cat wishlist-hits.txt wishlist.txt | uniq -u >> uncrawled.txt “Read the contents of both files. Save wishlist entries that don’t appear in the access logs.” Tip: use an indexing service like linklicious to encourage crawling the uncrawled.
  • 90. Also
  • 91. These Techniques Apply to Other SEO Activities. Enterprise Link Audits. Enterprise Keyword Research. Enterprise Spamming.
  • 92. Thanks. Oliver Mason Technical SEO Consultant Twitter: @ohgm Email: ohgm@ohgm.co.uk
  • 93. Resources GOW: https://github.com/bmatzelle/gow Cygwin: http://cygwin.com/install.html Command Line Crash Course: http://cli.learncodethehardway.org/book/ Shameless links to my own stuff: http://ohgm.co.uk/filter-server-logs-to-googlebot/ http://ohgm.co.uk/watch-googlebot-crawling/ http://ohgm.co.uk/preserve-link-equity-with-file-aliasing/ http://ohgm.co.uk/wayback-machine-seo/
  • 94. Tools Used in this Talk grep sort split shuf find uniq awk wc cowsay