Doing SEO for large
websites.
How is it different and how can we be good at it?
…
…
…
4.5km
31 m
1.8m & 150kg
1.8m & 220kg
17x larger
4,913x heavier
220kg - 8,800 calories
~43,586,000 calories a day
257 calories
~2,307 big macs
88 McDonalds
52,175,112 Calories
x2
x2 x2
x2 x3
x2
SLOWER DIFFICULT TO WORK WITH
Templates
Getting (& processing) data
Finding technical issues
Preventing technical issues
Common problems
Templates
I would like a 1000 problems please.
“Please fix all 18,304 pages”
LIES
LIES
5
Category
Home page
Product
Contact Us
Obviously different
Small product number Main category page
Out of stock product
Extremes
Facet category page Reviews Page 2
Same page different URL
Country
County
City
Area/District
Street
Getting (& processing) data
Impressions week by week for new content
Pre change Post change
Clicks pre and post change for site sections
Competing pages for a set of terms
SLOWER DIFFICULT TO WORK WITH
SAMPLING
SLOWER DIFFICULT TO WORK WITH
SAMPLING
LIMITS
1,000 rows at a time
SLOWER DIFFICULT TO WORK WITH
SAMPLING
LIMITS LAG
SLOWER DIFFICULT TO WORK WITH
SAMPLING
LIMITS LAG
SEGMENTATION
Search console properties for a
large brand.
Part 1: Search console
Part 2: Data Studio
Part 3: APIs
Part 4: Data warehousing
Register all the things.
5 sub-folders provided
260% more keywords
Part 1: Search console
Part 2: Data Studio
Part 3: APIs
Part 4: Data warehousing
Data studio for extracting
data
● Add a Google search
console data source
● Create a table for it.
● Download the table.
You’ll get everything in the
table.
Part 1: Search console
Part 2: Data Studio
Part 3: Python
Part 4: Data warehousing
Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
As a workflow I’d highly
recommend Jupyter notebooks
for getting started.
● Why use jupyter
notebooks?
● SearchLove Video (paid)
SEO Pythonistas
A memorial and soon to be
collection of Hamlet’s excellent
work.
SEO Pythonistas - In loving
memory of Hamlet Batista
@DataChaz
Part 1: Search console
Part 2: Data Studio
Part 3: Python
Part 4: Data warehousing
Analyse
Get data
Analyse
Get data
Analyse
Store data
Get data
Report
Analyse
Analyse
Get data
Store data
Get data
Report
A developer could do it.
Rolling your own
JC Chouinard has built a series
of excellent granular tutorials
which walk you through
setting up one on your own
machine.
Link.
Off the shelf
Get in touch with me!
I run PipedOut which is
software for building SEO data
warehouses.
Finding technical issues
Part 1: Templates
Part 2: Logs
Part 3: Crawling Big
Part 1: Templates
Part 2: Logs
Part 3: Crawling Big
Not the same fields as a crawl.
No page title for example.
● Crawling & indexing problems
● Crawling & indexing problems
● Measuring freshness
Time until article crawled
● Crawling & indexing problems
● Measuring freshness
● Prioritisation
● Crawling & indexing problems
● Measuring freshness
● Prioritisation
● Monitoring website changes (e.g. migrations)
Jun ‘19
Apr ‘19 Aug‘19 Oct ‘19
200 301 302
Status codes in product pages
● Crawling & indexing problems
● Measuring freshness
● Prioritisation
● Monitoring website changes (e.g. migrations)
● Debugging
Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).
What time period do we want?
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is to look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re
finding etc.
We can absolutely do analysis with a month or so (we've even done it with just a week or two), but it means we lose historical context and obviously we're more likely to lose things on a larger side.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal information in?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.
Can we get logs from as close to the edge as possible?
It's pretty likely you've got a couple different layers of your network that might log. Ideally we want those from as close to the edge as possible. This prevents a couple issues:
● If you've got caching going on, like a CDN or Varnish then if we get logs from after them, we won't see any of the requests they answer.
● If you've got a load balancer distributing to several servers sometimes the external IP gets lost (perhaps X-Forwarded-For isn't working), which we need to verify Googlebot or we accidentally only get logs from a couple
servers.
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well. (Although of course if you're sending us CDN logs this won't matter.)
How do you log hostname and protocol?
It's very helpful for us to be able to see hostname & protocol. How do you distinguish those in the log files?
Do you log HTTP & HTTPS to separate files? Do you log hostname at all?
This is one of the problems that's often solved by getting logs closer to the edge, as while many servers won't give you those by default, load balancers and CDN's often will.
Where would we like the logs?
In an ideal world, they would be files in an S3 bucket and we can draw them down from there. If possible, we'd also ask that multiple files aren't zipped together for upload, because that makes processing harder. (No problem with
compressed logs just, just zipping multiple log files into a single archive).
Is there anything else we should know?
Best,
{x}
ELK Stack
Pros
● Good for basic
monitoring.
● Your developers might
already have it.
Cons
● Quite hard to learn.
● Not great for analysis
past the basics.
AWS Athena
Pros
● If your logs are being
stored in AWS S3 it’s
very easy to set-up.
● Powerful analysis.
Cons
● Interface is clunky.
● SQL debugging isn’t
good.
BigQuery
Pros
● Best analysis platform.
● Easy to use.
● Excellent debugging.
Cons
● Someone will have to
actively load the data
into it.
Part 1: Templates
Part 2: Logs
Part 3: Crawling Big
Sampling your crawl
● Limit your crawl
percentage per
template.
i.e.
● 20% to product pages
● 30% to category pages
Low memory crawler
Runs locally on your machine
and allows you to crawl with a
very low memory footprint.
Doesn’t render JS or process
data however.
Run SF in the cloud
You can purchase a super high
memory computer in the
cloud, install SF on it and run it
at maximum speed.
Preventing technical issues
Search console properties for a
large brand.
Part 1: Manually crawling
Part 2: Change detection reports
Part 3: Unit testing
Change detection with SF
Change detection with SF
Part 1: Manually crawling change detection
Part 2: Automating assertions
Part 3: Unit testing
<meta name="robots" content="noindex">
<meta name="robots" content="noindex,nofollow">
<meta name="robots" content="noindex">
Is it different?
Is it the value I want?
Is it different?
<meta name="robots" content="noindex,nofollow">
<meta name="robots" content="noindex">
Element Equals
Title Big Brown Shoe - ÂŁ12.99 - Example.com
Status Code 200
H1 Big Brown Shoe
Canonical <link rel="canonical" href="https://example.com/product/big-brown-shoe" />
CSS Selector: #review-counter Any number
CSS Selector: #product-data {
"@context": "https://schema.org/",
"@type": "Product",
"name": "Big Brown Shoe",
"description": "The biggest brownest show you can find.",
"sku": "0446310786",
"mpn": "925872",
}
Python Script / App
script
What we got. What we expect.
Status code
for debugging
Part 1: Manually crawling
Part 2: Change detection reports
Part 3: Unit testing
Create code Test code Deployment
Create code Test code Deployment
All our hard work.
Create code Test code Deployment
All our hard work.
Create code Test code Deployment
Is the canonical tag set to:
<link rel=“canonical” href=“https://example.com/mypage/” />
Production
Development
Is the canonical tag set to:
<link rel=“canonical” href=“https://example.com/mypage/” />
Production
Development
False
Is the canonical tag set to:
<link rel=“canonical” href=“https://example.com/mypage/” />
Production
Development
False
True
Is the canonical tag set to:
<link rel=“canonical” href=“https://example.com/mypage/” />
Production
Development
False
True
Unit tests
endtest.io
Common problems
Part 1a: Crawling & Indexing
Part 1b: Subdomains
Part 2: Sitemaps
Part 3: Links
Part 1a: Crawling & Indexing
Part 1b: Subdomains
Part 2: Sitemaps
Part 3: Links
Turns out we accidentally
indexed identical content on
a different sub-domain
DNS Dumpster
Part 1a: Crawling & Indexing
Part 1b: Subdomains
Part 2: Sitemaps
Part 3: Links
00:00
06:00
12:00
18:00
24:00
Generate sitemaps
Part 1a: Crawling & Indexing
Part 1b: Subdomains
Part 2: Sitemaps
Part 3: Links
Enamel
Black
Enamel
Black
Enamel Charm
Black
Enamel Charm Used
0.3
3
0.3
3
0.3
3
0.3
3
Robots.txt
block
nofollow
Conclusions
@dom_woodman
bit.ly/seo-for-large-websites
bit.ly/seo-data-warehouse
@dom_woodman

SEO for Large Websites

Editor's Notes

  • #2 Last line, lets get to it
  • #3 We are almost near the release of godzilla vs kong. My finance is a huge fan of these movies, if you’ve not seen them, they are giant silly monster movies. Where cities are torn down and we all just live with this shit.
  • #4 I think they have a very clear storyboading process. And anyone at any point who asks, is fire anyone who asks “how would this work” and replace them with someone who says “but what if he lived in atlantis”, “but what if dragons” ------- “Godzilla fights his nemesis” “What is his nemesis was a giant dragon” “What if the giant dragon controlled all the other giant monsters” “What if godzilla lived in atlantis” “With a nuclear fountain” They are just the master of escalation. They unfreeze and super dragon, another ancient lizard awakes to fight it, when it gets hurt it retreats to atlantis, where it heals with nuclear warheads. It’s joyful how huge and silly everything is. You hold your belief at the door, because in reality when things get large, everything that was previously easy suddenly becomes hard.
  • #6 I do htink these movies are good example of how things become hard
  • #7 Could we have a living version of king kong? I think King Kong is a really good example of the problems you get when you scale something up. SO RUN WITH ME for a minute. Give me 3 minutes
  • #8 We do have skull island
  • #71 One they panicked and wasted half a day looking into this 2. They they had to filter it out of future reports
  • #187 wh