SEO for Large Websites

Doing SEO for large
websites.
How is it different and how can we be good at it?

257 calories
~2,307 big macs
88 McDonalds
52,175,112 Calories

I would like a 1000 problems please.

“Please fix all 18,304 pages”

Category
Home page
Product
Contact Us
Obviously different

Small product number Main category page
Out of stock product
Extremes

Facet category page Reviews Page 2
Same page different URL

Country
County
City
Area/District
Street

Impressions week by week for new content

Pre change Post change
Clicks pre and post change for site sections

Competing pages for a set of terms

SLOWER DIFFICULT TO WORK WITH
SAMPLING

SAMPLING
LIMITS

SAMPLING
LIMITS LAG

SAMPLING
LIMITS LAG
SEGMENTATION

Search console properties for a
large brand.

Part 1: Search console
Part 2: Data Studio
Part 3: APIs
Part 4: Data warehousing

5 sub-folders provided
260% more keywords

Data studio for extracting
data
● Add a Google search
console data source
● Create a table for it.
● Download the table.
You’ll get everything in the
table.

Part 1: Search console
Part 2: Data Studio
Part 3: Python
Part 4: Data warehousing

Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API

data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data

data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
As a workflow I’d highly
recommend Jupyter notebooks
for getting started.
● Why use jupyter
notebooks?
● SearchLove Video (paid)

SEO Pythonistas
A memorial and soon to be
collection of Hamlet’s excellent
work.
SEO Pythonistas - In loving
memory of Hamlet Batista
@DataChaz

Analyse
Get data
Analyse
Store data
Get data
Report

Analyse
Analyse
Get data
Store data
Get data
Report

Rolling your own
JC Chouinard has built a series
of excellent granular tutorials
which walk you through
setting up one on your own
machine.
Link.

Off the shelf
Get in touch with me!
I run PipedOut which is
software for building SEO data
warehouses.

Part 1: Templates
Part 2: Logs
Part 3: Crawling Big

Not the same fields as a crawl.
No page title for example.

● Crawling & indexing problems

● Measuring freshness

● Prioritisation

● Prioritisation
● Monitoring website changes (e.g. migrations)

Jun ‘19
Apr ‘19 Aug‘19 Oct ‘19
200 301 302
Status codes in product pages

● Prioritisation
● Monitoring website changes (e.g. migrations)
● Debugging

Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).
What time period do we want?
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is to look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re
finding etc.
We can absolutely do analysis with a month or so (we've even done it with just a week or two), but it means we lose historical context and obviously we're more likely to lose things on a larger side.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal information in?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.
Can we get logs from as close to the edge as possible?
It's pretty likely you've got a couple different layers of your network that might log. Ideally we want those from as close to the edge as possible. This prevents a couple issues:
● If you've got caching going on, like a CDN or Varnish then if we get logs from after them, we won't see any of the requests they answer.
● If you've got a load balancer distributing to several servers sometimes the external IP gets lost (perhaps X-Forwarded-For isn't working), which we need to verify Googlebot or we accidentally only get logs from a couple
servers.
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well. (Although of course if you're sending us CDN logs this won't matter.)
How do you log hostname and protocol?
It's very helpful for us to be able to see hostname & protocol. How do you distinguish those in the log files?
Do you log HTTP & HTTPS to separate files? Do you log hostname at all?
This is one of the problems that's often solved by getting logs closer to the edge, as while many servers won't give you those by default, load balancers and CDN's often will.
Where would we like the logs?
In an ideal world, they would be files in an S3 bucket and we can draw them down from there. If possible, we'd also ask that multiple files aren't zipped together for upload, because that makes processing harder. (No problem with
compressed logs just, just zipping multiple log files into a single archive).
Is there anything else we should know?
Best,
{x}

ELK Stack
Pros
● Good for basic
monitoring.
● Your developers might
already have it.
Cons
● Quite hard to learn.
● Not great for analysis
past the basics.

AWS Athena
Pros
● If your logs are being
stored in AWS S3 it’s
very easy to set-up.
● Powerful analysis.
Cons
● Interface is clunky.
● SQL debugging isn’t
good.

BigQuery
Pros
● Best analysis platform.
● Easy to use.
● Excellent debugging.
Cons
● Someone will have to
actively load the data
into it.

Sampling your crawl
● Limit your crawl
percentage per
template.
i.e.
● 20% to product pages
● 30% to category pages

Low memory crawler
Runs locally on your machine
and allows you to crawl with a
very low memory footprint.
Doesn’t render JS or process
data however.

Run SF in the cloud
You can purchase a super high
memory computer in the
cloud, install SF on it and run it
at maximum speed.

Part 1: Manually crawling
Part 2: Change detection reports
Part 3: Unit testing

Part 1: Manually crawling change detection
Part 2: Automating assertions
Part 3: Unit testing

Is it the value I want?
Is it different?

Element Equals
Title Big Brown Shoe - £12.99 - Example.com
Status Code 200
H1 Big Brown Shoe
Canonical <link rel="canonical" href="https://example.com/product/big-brown-shoe" />
CSS Selector: #review-counter Any number
CSS Selector: #product-data {
"@context": "https://schema.org/",
"@type": "Product",
"name": "Big Brown Shoe",
"description": "The biggest brownest show you can find.",
"sku": "0446310786",
"mpn": "925872",
}

What we got. What we expect.
Status code
for debugging

Create code Test code Deployment

Create code Test code Deployment
All our hard work.

Is the canonical tag set to:
<link rel=“canonical” href=“https://example.com/mypage/” />
Production
Development

Production
Development
False

Production
Development
False
True

Part 1a: Crawling & Indexing
Part 1b: Subdomains
Part 2: Sitemaps
Part 3: Links

Turns out we accidentally
indexed identical content on
a different sub-domain

00:00
06:00
12:00
18:00
24:00
Generate sitemaps

Enamel
Black
Enamel
Black
Enamel Charm
Black
Enamel Charm Used

0.3
3
Robots.txt
block
nofollow

@dom_woodman
bit.ly/seo-for-large-websites
bit.ly/seo-data-warehouse
@dom_woodman

SEO for Large Websites

More Related Content

What's hot

Similar to SEO for Large Websites

More from Dominic Woodman

Recently uploaded

SEO for Large Websites

Editor's Notes