Log analysis and pro use cases for search marketers online version (1)

Log Analysis and PRO Use Cases
for Search Marketers
Dave Sottimano - Untagged.io - Madrid 2016

Prepare yourself for bullet point
hell.
It’s meant for reading:
bit.ly/untagged2016
Lo siento :(

Inflated & ambiguous stats
80,000
80,000
https://support.google.com/webmasters/answer/35253?hl=en

Seems broken. It is broken, this is actually an image
search result and it has been ranking through the entire
time period.
???

But hey, reporting stats for the entire
internet isn’t easy.
So, thank you Google.

Why server log analysis is so important:

How do we try and increase crawl frequency?
Increase External link count (includes links from social sites)
List valuable pages in sitemaps and ping Google
Increase Internal link count (crawl paths)
Create new pages, and update older pages (avoid stagnation)
Ensure pages are unique, reduce internal duplication
Avoid internally linking to redirects or broken pages
Testing. Lots of testing.

What actions do SEOs take from log analysis?
● Optimize Googlebot crawl
○ restructure link architecture, apply directives, block via robots.txt
● Find server errors or Googlebot induced errors
○ Try to fix any 4xx, 5xx error codes
○ Use browser user agent referer fields to uncover source of errors
● Understand Googlebot crawl rate & behaviour for SEO testing
○ Helpful for testing and insights and constantly questioning best practices
● Block badly behaving bots, prevent bandwidth drain
○ Look for hotlinking bandwidth drain, i.e images from porn sites
● Find unreported links through referer fields
○ Link crawlers don’t find every link, server logs are necessary for comprehensive audits
● Double check Analytics data
○ Helpful for correcting analytics setup or understanding why referers aren’t passed correctly

The hard part:
Getting the right data and
merging.

Step 1: Get the right fields logged
206.248.146.167 - - [25/Aug/2015:06:50:01 +0000] "GET /shoes HTTP/1.0" 200 251
"https://www.google.ca/" “example.com” "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
User agent
IP Address Date/Time
Referer
Method
Response code
Page
Response time
Hostname

Step 2: Ensure the correct originating IP is logged
Load balancers, proxies or CDN’s may overwrite the original IP of the request. Use
X-Forwarded-For header for to ensure you have the original IP
IIS: http://www.loadbalancer.org/blog/iis-and-x-forwarded-for-header
Apache: http://www.loadbalancer.org/blog/apache-and-x-forwarded-for-headers
Nginx: https://easyengine.io/tutorials/nginx/forwarding-visitors-real-ip/
CloudFlare:
https://support.cloudflare.com/hc/en-us/sections/200805497-Restoring-Visitor-IPs

Step 3: Ensure we have all of the logs
● Triple check the hostname! If you’re analyzing example.com, desktop for
instance, ensure you’re not counting the mobile version (m.example.com) or
other subdomains (forum.example.com). Be very careful to get the right data
or you will pull your hair out. Ask system administrators!
● If the server stores cached copies and serves them from another server, get
those logs too and combine them for the target domain analysis.
● Too much data? Ask for selective logging for Googlebot user agent only

Step 4: Parse the logs, grab Googlebot entries
https://www.splunk.com/en_us/download/splunk-light-2.html

Step 5: Verify Googlebot entries by DNS
1. Segment out logs with user-agent: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
2. Take the original IP in the logs, example: 66.249.65.63
3. Reverse DNS lookup: crawl-66-249-65-63.googlebot.com
4. DNS lookup: 66.249.65.63 (confirmed!)
https://support.google.com/webmasters/answer/80553?hl=en
Software I use: http://www.nirsoft.net/utils/ipnetinfo.html

Note to myself: Look out for Google Mobile user
agents
Mozilla/5.0+(iPhone;+CPU+iPhone+OS+6_0+like+Mac+OS+X)+AppleWebKit/536.26
+(KHTML,+like+Gecko)+Version/6.0+Mobile/10A5376e+Safari/8536.25+(compatible;
+Googlebot/2.1;++http://www.google.com/bot.html)
This is a verified Googlebot from 66.249.65.63, but it’s not listed on the official
crawlers page.
Official Google: Mobile-first Indexing

Step 5: Merge Crawl data with clean logs
● Crawl as: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) and a popular browser user agent
● Crawler config: Disobey Robots.txt, crawl all non-HTML, crawl internal
nofollow, crawl canonicals & sitemaps, ideally JS enabled
● Fields required: URL, Response code, Title, Robots directives (blocked,
noindex, nofollow etc.), Canonical, Page size, response time, crawl level,
number of internal links to page
Try DeepCrawl for free bit.ly/freecrawl - 25,000 credits for Untagged.io

Step 6: Add Web Analytics data
● Ensure the the URLs correspond correctly (special characters, full URL)
● Ensure the date period is exactly the same period as server logs
● Use data from source/medium = Google/Organic only
DeepCrawl can do merge both crawl and analytics data from Google Analytics

So far...
● Have all logs from the right host with the right
fields
● Have the original IP addresses
● Confirmed real Googlebot visits
● Merged crawl data and analytics data perfectly

Just when you think all the data is correct,
something will go wrong, guaranteed ;)

Real example, small site:
http://www.campgroundsigns.com/
7 million events from load balancer, IIS custom format access logs= 1.6 gB of
data
13,000 Googlebot events over 28 days
1,129 pages are indexable on
campgroundsigns.com

Caveat!
The following are observations based on 1 small website. The
observations for this site are only for this site and are not
representative.
Each website and it’s Googlebot crawl activity are different.
Special thanks to campgroundsigns.com for volunteering for
the analysis

What we wanted crawled vs What Google
crawled
Based on a 28 day sample

What did Google crawl by Page Type?

How we determine page types
By URL (if possible):
example.com/products/product-123
By unique HTML template footprint (recommended):

How does crawl
affect traffic?

Googlebot crawl and Organic Traffic
Based on a 28 day sample

Is Googlebot using
it’s crawl wisely?

Of those pages, what were the response codes?

The 4% of 410 errors are actually caused by
Google trying to render JavaScript.
127 pages per 28 day crawl are wasted.

How often did Google crawl NOINDEX pages?

Did Google crawl the right pages?
Indexable defined as: Response code: 200, no robots.txt block, self referencing canonical or no canonical in head or http header, no noindex
directives in head or http header, no directives applied in GSC param config, no removal request, not JS/CSS or resource files. Not indexable either
has non 200 response or one of the previous.

Generally, we see reduced crawl activity to
pages with NOINDEX.
There’s something wrong.
PLA = Product listing Ad.

We tried to block the PLA
pages to divert attention
to important pages:

Based on 4 day, Mon-Thursday period before and after the block
Errr, go back, quick.
All requests Unique pages crawled
Before After Before After
PLA (Blocked by robots) 1334 0 703 0
Department or other Page 404 212 270 124
Product page 605 247 452 177
resource 332 406 50 61
Homepage 15 15 1 1
Totals 2690 880 1476 363
Difference -67% -75%

Turns out, Google uses their regular Googlebot
crawler to crawl them, not Adbot.
It was a mistake blocking these. We’ll try
canonicals next.
https://support.google.com/merchants/answer/160156?hl=en

Insights:
If Googlebot crawls a
page, is it always indexed?

No. Out of a sample of 691 pages, 12 were crawled and not
indexed.

Insights:
If a page gets Google
organic traffic, is it always
indexed?

No. 225 pages receiving at least 1 Google Organic visit
during the time period:
Read why: http://bit.ly/2faEoA9

How did I check
indexation?
Sorry, this section is not
available for non
attendees!

Frequent question for SEOs:
How long will it take for Google
to update index after a
migration?

Googlebot 2.1 pages crawled per day, vs Search Console

Where to find pages crawled per day in Search
Console?

Tip: Disable all CSS styles for easy copy

How many unique URLs are crawled per day?

Googlebot only crawled 766 pages out of the 1,129
we wanted crawled over 28 days.

More realistic, still estimated, but slightly less
bullshit:
● 766 unique, indexable pages were crawled over 28 days
● That gives us an Average of 27 unique pages crawled
per day.
● 1129 total indexable pages / 27 = minimum 42 days for a
full recrawl.
Remember, this is a complete estimate.

That doesn’t even account for how many times
Google has to figure out a 301 redirect.

Same calculation, different site (with approx 86,000
indexable pages)
This is not representative of any other site.

Fixing the problems isn’t
always easy.
But it does pay off.

Here’s some
fancy charts
moving up and to
the right as proof.

If it seems Google isn’t respecting robots.txt,
check:
10 day
lag!

Server log analysis is hard. Here’s why:
● Data size challenges, example: 7 million events = 1.6 gB (and that’s tiny)
● Lots of different servers logging with custom formats
● Often, obtaining them means surpassing people problems & technical
challenges
● Any small mistakes combining crawl, analytics, search console data can make
the entire analysis useless
● Combining large datasets requires either some form of programming or
technical knowledge; it’s not for everyone.
● Many available tools aren’t comprehensive enough for SEO purposes yet.
That being said, they are the best thing since patatas bravas
con alioli.

Things that can corrupt your results
● Thinking you’re seeing Googlebot but it’s not really Googlebot
● Not accounting for robots.txt restrictions changes or other directive changed in
crawl data during logging period
● Incorrect field mapping, i.e. mistake referer for page request
● Incorrect merging of crawl and analytics data

Helpful links for log analysis
Guides:
● A Complete Guide to Log Analysis with Big Query - Dominic Woodman
● The Ultimate Guide to Log File Analysis - Daniel Butler
● SEO Finds in Your Server Log Tim Resnik
● How to Use Server Log Analysis for Technical SEO Samuel Scott
Software:
● Splunk
● SEO Log File Analyser
● Logz.io
● Botify

Muchas gracias Untagged!
@dsottimano
http://www.definemg.com

Log analysis and pro use cases for search marketers online version (1)

More Related Content

What's hot

Viewers also liked

Similar to Log analysis and pro use cases for search marketers online version (1)

Recently uploaded

Log analysis and pro use cases for search marketers online version (1)