How to Automatically Subcategorise Your Website Automatically With Python
The document describes a Python script that can automatically generate new subcategories for an ecommerce website based on clustering product names. It discusses:
- Using NLTK to generate n-grams from product names to cluster related products
- Filtering the n-grams to keep only those with commercial value by checking for search volume and CPC data
- Running the script on a large home improvement site to identify over 1,650 new subcategory opportunities with a total search volume of over 13 million
- Sharing the script so others can automate subcategory identification for their own sites to scale up an important SEO tactic.
Overview of the presenter, Lee Foot, his experience, and achievements in SEO.
Introduction to Python as a high-level programming language, popular for automating tasks and data manipulation.
Outline of the presentation covering benefits of subcategorisation, required resources, processes, script outputs, and limitations.
Explains how subcategorisation can increase search traffic, with a specific example showing potential search gains.
Current manual methods for identifying categorisation opportunities are inefficient and miss potential.
Presentation of a Python script designed to automate the subcategorisation process and improve efficiency.
Explanation of n-grams for keyword generation, and required tools such as Screaming Frog and API services.
Steps to crawl a website to extract product and category information needed for further processing.
Details the clustering of keywords, filtering out irrelevant ones, and final counts for actionable subcategories.
Results of the script process, showing significant keyword reduction and suggesting high-volume subcategories.
Limitations of the script regarding naming conventions and the need for some manual cleanup.Ideas on how to automate the script execution and integrate it into regular client work.
Rationale for using Screaming Frog for website crawling over building dedicated crawlers.
Information on where to download the Python script and resources for learning Python in SEO.
What is Python?
Pythonis a high level programming
language which is perfect for
automating repetitive tasks
8.
What is Python?
Pythonis a high level programming
language which is perfect for
automating repetitive tasks
Very popular in the data science
community
9.
What is Python?
Pythonis a high level programming
language which is perfect for
automating repetitive tasks
Very popular in the data science
community
Becoming very popular with technical
SEOs
Especially for data blending and
automation
LEATHER
SOFAS
VELVET SOFAS LOUNGESOFAS
SOFAS
Creating
three new
subcategories
would create
an additional
21,000+
searches a
month*
+19,000 +1,200 +150
*source ahrefs.com
22.
LEATHER
SOFAS
VELVET SOFAS LOUNGESOFAS
SOFAS
This method
will produce
a lot of
additional
traffic for
any
eCommerce
site
+19,000 +1,200 +150
We wrote aPython
script to automate
the process and do
the hard work for
us!
@LeeFootSEO | #BrightonSEO
31.
LEATHER
SOFAS
VELVET SOFAS LOUNGESOFAS
SOFAS
The
products
suggest
the
categorie
s for us!
+19,000 +1,200 +150
Leather Buttoned Sofa
Mid Century Leather
Sofa
Tetbury Leather Sofa -
Black
Hardwick Leather Sofa
Tetbury Leather
Sofa - Tan
@LeeFootSEO | #BrightonSEO
32.
LEATHER
SOFAS
VELVET SOFAS LOUNGESOFAS
SOFAS
By clustering
the product
names together
our script was
able to find
opportunities
for new
categories
+19,000 +1,200 +150
Leather Buttoned Sofa
Mid Century Leather
Sofa
Tetbury Leather Sofa -
Black
Hardwick Leather Sofa
Tetbury Leather
Sofa - Tan
@LeeFootSEO | #BrightonSEO
33.
Total Opportunity –Cox &
Cox
New Subcategories: 185
Search Volume: 1,400,000
@LeeFoot@SEO | #BrightonSEO
34.
In testing weran
the script on
Homebase and found
opportunity to
create
1,650
subcategories with
over
13,000,000
estimated monthly
searches
@LeeFootSEO | #BrightonSEO
35.
This would take
aLONG time to
do manually!
(Assuming you
could work as
efficiently as
a computer!)
@LeeFootSEO | #BrightonSEO
36.
At the endof
this talk I’m
going to share
this script
with
instructions so
you can use it
on your own
Websites
@LeeFootSEO | #BrightonSEO
The
Method
We’ll be using
Pythonand the
NLTK library to
generate hundreds
of thousands of N-
gram combinations
from product names
@LeeFootSEO | #BrightonSEO
aa alkaline
aa alkalinebatteries
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables
self
@LeeFootSEO | #BrightonSEO
Examples of N-Grams the Script will
Generate from clustering product nam
47.
@LeeFootSEO | #BrightonSEO
Onlyone of these
suggestions has commercial
value
aa alkaline
aa alkaline batteries
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables
self
48.
@LeeFootSEO | #BrightonSEO
Ourgoal is to programmatically
discard the non-sensical ones and
keep any with commercial value
aa alkaline
aa alkaline batteries
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables
self
49.
@LeeFootSEO | #BrightonSEO
SoLet’s Check for Search
Volume!
aa alkaline(20)
aa alkaline batteries(80)
aa alkaline batteries command(0)
aa alkaline batteries command adjustables(0)
aa alkaline batteries command adjustables
self(0)
50.
Everything is Redwill be
discarded automatically because
they have no search volume
aa alkaline (20)
aa alkaline batteries (80)
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables
self
@LeeFootSEO | #BrightonSEO
51.
Checking n-grams forkeyword
volume does a lot of the hard
work but it’s not perfect
aa alkaline (20)
aa alkaline batteries (80)
@LeeFootSEO | #BrightonSEO
52.
To deal withthis we have included
pre and post configurable
filtering options
aa alkaline (20)
aa alkaline batteries (80)
@LeeFootSEO | #BrightonSEO
Keep Longest Word Fragment = True
You Will Need
ScreamingFrog – To crawl the site
Keywords Everywhere API – To check search
volume ($10 for 100,000 creds)
@LeeFootSEO | #BrightonSEO
65.
You Will Need
ScreamingFrog – To crawl the site
Keywords Everywhere API – To check search
volume ($10 for 100,000 creds)
Python with the following libraries
imported
@LeeFootSEO | #BrightonSEO
66.
You Will Need
ScreamingFrog – To crawl the site
Keywords Everywhere API – To check search
volume ($10 for 100,000 creds)
Python with the following libraries
imported
NLTK – Used to create n-gram word
combinations
@LeeFootSEO | #BrightonSEO
67.
You Will Need
ScreamingFrog – To crawl the site
Keywords Everywhere API – To check search
volume ($10 for 100,000 creds)
Python with the following libraries
imported
NLTK – Used to create n-gram word
combinations
PolyFuzz – To match KWs to existing
categories
@LeeFootSEO | #BrightonSEO
r
.csv exports are
readinto Python
and processed with
the Natural
Language Tool Kit
library.
@LeeFootSEO | #BrightonSEO
78.
Cluster
Product names are
clusteredtogether
using n-grams to
generate new words
Keyword
aa alkaline
aa alkaline batteries
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables self
aa alkaline batteries command adjustables self
adhesive
aa alkaline batteries duracell
aa alkaline batteries duracell optimum
aa alkaline batteries duracell optimum aa
aa alkaline batteries duracell optimum aa
batteries
aa alkaline batteries duracell plus
aa alkaline batteries duracell plus battery
aa alkaline batteries duracell plus battery pack
aa alkaline batteries duracell plus lr
aa alkaline batteries duracell plus lr aa
aa alkaline batteries duracell specialty
aa alkaline batteries duracell specialty alkaline
aa alkaline batteries duracell specialty alkaline
button
aa alkaline batteries energizer
aa alkaline batteries energizer maxplus
aa alkaline batteries energizer maxplus aa
aa alkaline batteries energizer maxplus aa
batteries
@LeeFootSEO | #BrightonSEO
79.
Cluster
Products are
clustered category
bycategory (so if a
product lives in two
categories, it’ll be
clustered twice)
Keyword
aa alkaline
aa alkaline batteries
aa alkaline batteries command
aa alkaline batteries command adjustables
aa alkaline batteries command adjustables self
aa alkaline batteries command adjustables self
adhesive
aa alkaline batteries duracell
aa alkaline batteries duracell optimum
aa alkaline batteries duracell optimum aa
aa alkaline batteries duracell optimum aa
batteries
aa alkaline batteries duracell plus
aa alkaline batteries duracell plus battery
aa alkaline batteries duracell plus battery pack
aa alkaline batteries duracell plus lr
aa alkaline batteries duracell plus lr aa
aa alkaline batteries duracell specialty
aa alkaline batteries duracell specialty alkaline
aa alkaline batteries duracell specialty alkaline
button
aa alkaline batteries energizer
aa alkaline batteries energizer maxplus
aa alkaline batteries energizer maxplus aa
aa alkaline batteries energizer maxplus aa
batteries
@LeeFootSEO | #BrightonSEO
Filterin
g
We started by
generatingover half
a million n-grams
using existing
products on
wilko.com
597,66
4
@LeeFoot@SEO | #BrightonSEO
@LeeFootSEO | #BrightonSEO
82.
Filterin
g
34,000 were
matched toa
minimum of three
products and the
rest discarded
597,66
4
@LeeFoot@SEO | #BrightonSEO
34,100
@LeeFootSEO | #BrightonSEO
83.
Filterin
g
Just under 9,000
keywordsremained
after deduplication
These were then
checked for search
volume
597,66
4
@LeeFoot@SEO | #BrightonSEO
34,100
8,969
@LeeFootSEO | #BrightonSEO
84.
Filterin
g
The final output
contained1,883
subcategorisation
opportunities ready to
QA
597,66
4
@LeeFoot@SEO | #BrightonSEO
34,100
8,969
1,883
@LeeFootSEO | #BrightonSEO
85.
Filterin
g
99.68% of all
keywordswere
discarded before the
final output!
Essentially, we brute
forced the
opportunity
597,66
4
@LeeFoot@SEO | #BrightonSEO
34,100
8,969
1,883
@LeeFootSEO | #BrightonSEO
86.
Typical Script
Output
Total SubcategoriesGenerated : 597,6
Matched to Min of: 3 Products: 34,088
Remaining after de-duplication: 8,969
Subcategories with Search Volume: 1,8
Total Volume: 8,023,629
Discarded: 99.68 % of Keywords!
Completed in: 16.15 Minutes
@LeeFootSEO | #BrightonSEO
Parent Category SuggestedSubcategory Vol CPC # Products Similarity Closest Matched Category
/outdoor-toys/climbing-frames.list rope ladders 2,400 0.28 4 73% loft ladder new ladders
/outdoor-toys/climbing-frames.list wooden climbing frames 90 0.78 3 72% climbing plants
/outdoor-toys/garden-swings.list double swing sets 1,900 0.54 3 61% double beds
/outdoor-toys/garden-swings.list single swing sets 1,000 0.32 4 58% garden swings
/outdoor-toys/garden-swings.list wooden swing sets 8,100 0.97 8 70% wooden garden swing seats
/outdoor-toys/ride-on-toys.list pro stunt scooter 320 0.61 5 20% protect garden
/outdoor-toys/role-play-toys.list outdoor play kitchen 320 0.38 3 82% outdoor kitchens
/outdoor-toys/sandpits.list activity tables 6,600 0.24 7 38% cavity wall
/outdoor-toys/sandpits.list planter tables 5,400 0.28 3 76% planters
/outdoor-toys/sandpits.list plum discovery toys 320 0.7 3 43% ecover
/outdoor-toys/sandpits.list water tables 27,100 0.37 3 59% 6 seater tables
/outdoor-toys/sandpits.list water tracks 480 0.3 3 47% track set shop by room
/outdoor-toys/trampolines.list junior trampolines 880 0.31 4 66% trampolines
It also shows
the number of
products
available to
populate the
new
categories!
@LeeFootSEO | #BrightonSEO
94.
Parent Category SuggestedSubcategory Volume CPC # Products Similarity Closest Matched Category
/outdoor-toys/climbing-frames.list rope ladders 2,400 0.28 4 73% loft ladder new ladders
/outdoor-toys/climbing-frames.list wooden climbing frames 90 0.78 3 72% climbing plants
/outdoor-toys/garden-swings.list double swing sets 1,900 0.54 3 61% double beds
/outdoor-toys/garden-swings.list single swing sets 1,000 0.32 4 58% garden swings
/outdoor-toys/garden-swings.list wooden swing sets 8,100 0.97 4 70% wooden garden swing seats
/outdoor-toys/ride-on-toys.list pro stunt scooter 320 0.61 5 20% protect garden
/outdoor-toys/role-play-toys.list outdoor play kitchen 320 0.38 3 82% outdoor kitchens
/outdoor-toys/sandpits.list activity tables 6,600 0.24 3 38% cavity wall
/outdoor-toys/sandpits.list planter tables 5,400 0.28 4 76% planters
/outdoor-toys/sandpits.list plum discovery toys 320 0.7 3 43% ecover
/outdoor-toys/sandpits.list
water
tables
27,10
0 0.37
3 59% 6 seater table
/outdoor-toys/sandpits.list water tracks 480 0.3 3 47% track set shop by room
/outdoor-toys/trampolines.list junior trampolines 880 0.31 4 66% trampolines
/outdoor-toys/trampolines.list trampoline accessory kits 70 0.26 4 69% accessory d-line
Suggested categories with high
search demand, but low inventory
can signal that it could be time
to expand the range to tap into
the demand…
Low Inventory
High Demand
@LeeFootSEO | #BrightonSEO
95.
Parent Category SuggestedSubcategory Vol CPC # Products
Similarit
y Closest Matched Category
/outdoor-toys/climbing-frames.list rope ladders 2,400 0.28 4 73%
loft ladder new
ladders
/outdoor-toys/climbing-frames.list wooden climbing frames 90 0.78 3 72% climbing plants
/outdoor-toys/garden-swings.list double swing sets 1,900 0.54 3 61% double beds
/outdoor-toys/garden-swings.list single swing sets 1,000 0.32 4 58% garden swings
/outdoor-toys/garden-swings.list wooden swing sets 8,100 0.97 8 70%
wooden garden swing
seats
/outdoor-toys/ride-on-toys.list pro stunt scooter 320 0.61 5 20% protect garden
/outdoor-toys/role-play-toys.list outdoor play kitchen 320 0.38 3 82% outdoor kitchens
/outdoor-toys/sandpits.list activity tables 6,600 0.24 7 38% cavity wall
/outdoor-toys/sandpits.list planter tables 5,400 0.28 3 76% planters
/outdoor-toys/sandpits.list plum discovery toys 320 0.7 3 43% ecover
/outdoor-toys/sandpits.list water tables 27,100 0.37 6 59% 6 seater tables
/outdoor-toys/sandpits.list water tracks 480 0.3 3 47%
track set shop by
room
/outdoor-toys/trampolines.list junior trampolines 880 0.31 4 66% trampolines
All category suggestions
are fuzzy matched to
against existing
categories.
Categories which closely
match existing categories
(including plurals and
words out of order) are
removed automatically!
96.
Limitations and
Considerations
The outputis only
as good as the
naming conventions.
If product names are
short or non-
descriptive then
that’ll affect the
final output.
@LeeFootSEO | #BrightonSEO
97.
Limitations and
Considerations
The scriptwill output
keywords in the singular
tense
where as categories will
be pluralised because
they contain more than a
single product
@LeeFootSEO | #BrightonSEO
Automation
This script canbe automated
on a VPS in conjunction with
an automated crawl setup.
@LeeFootSEO | #BrightonSEO
100.
Automation
Perhaps client workcan be
road mapped every three
months with the output
automatically sent as an
email or a Slack channel
@LeeFootSEO | #BrightonSEO
101.
Remixes and Mashups
I’dlove to see some remixes,
mashups and improvements to the
script.
Just make sure you tag me in
anything you make!
@LeeFootSEO | #BrightonSEO
Don’t Wait🐍🔥
There isan awesome
community of SEOs Online who
are passionate about Python.
If you’re thinking about
getting started, come and
join us!
111.
Python Resources
YouTube Channels
CoreyShafer
Data School
Socratica
MIT Introduction
to Computer
Science & Python
Apps
Solo Learn (Android
/ iPhone)
Books
Automate the Boring
Stuff
112.
Python SEOs tofollow on
Twitter
@GregBernhardt4
@DataChaz
@OritSiMu
@DanielHereMe
@LeeFootSEO | #BrightonSEO
@SEOPythonistas
@rvtheverett
@vdrweb
@LeeFootSEO 😃
#113 I’ll tweet this out at the end toogreat community of python enthusiasts and professionals online.
If you want to get started – don’t wait! Make things and dive