SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
What a search engine can teach you about product sitemaps - BrightonSEO April 2018
Pricesearcher is a vertical search engine and our mission is to give consumers the complete view. Our technology processes 500m+ prices per day across 10 countries.
2017 saw the launch of Pricesearcher’s web crawler – PriceBot, to complete the indexing of all UK prices.
In this talk we will analyse what PriceBot discovered and how this information can help you improve the crawlability of your own site.
Pricesearcher is a vertical search engine and our mission is to give consumers the complete view. Our technology processes 500m+ prices per day across 10 countries.
2017 saw the launch of Pricesearcher’s web crawler – PriceBot, to complete the indexing of all UK prices.
In this talk we will analyse what PriceBot discovered and how this information can help you improve the crawlability of your own site.
What a search engine can teach you about product sitemaps - BrightonSEO April 2018
1.
Vlassios Rizopoulos
Chief Technology Officer @ pricesearcher.com
What a search engine can teach you about product
sitemaps
@Pricesearcher #BrightonSEO
2.
@Pricesearcher #BrightonSEO
BACKGROUND
Pricesearcher is a vertical search
engine focusing on products and their
prices.
Our mission is to provide access to all
the worlds prices in one place.
3.
@Pricesearcher #BrightonSEO
OUR MISSION IS TO INDEX ALL THE WORLD’S PRICES
4.
@Pricesearcher #BrightonSEO
SOURCES OF DATA
Product feeds
from 5000+ retailers
Developed plugins
Developed PriceBot to
complete the picture
5.
@Pricesearcher #BrightonSEO
PROGRESS TO DATE
Gathered data on 1.1 Billion products
Online in 11 Countries
Gathered 91 Billion price points for our products On average we check the price of a product 3 times a
day
We have gathered:
17,000,000 ISBNs
144,000,000 MPNs
73,000,000 SKUs
157,000,000 GTINs
GB / US / DE / FR / IT / IE / NO / SE / FI / DK / NG
6.
@Pricesearcher #BrightonSEO
WHAT IS PRICEBOT?
Pricebot is our proprietary crawler, built to discover products and turn unstructured data
from web pages into structured data for our product database
Pricesearcher is the only product search engine that crawls to complement our product
coverage
PriceBot is fully robots.txt compliant, leaves behind a footprint in its user agent and has a
built-in feedback mechanism
http://www.pricesearcher.com/pricebot
7.
@Pricesearcher #BrightonSEO
WHAT INFORMATION IS PRICEBOT COLLECTING?
We are looking to extract the following fields:
• Product Title
• Product Image
• Product Price
and optionally:
• Product Description
• Product Identifier (GTIN/UPC/EAN/ISBN)
• Product Brand
• Product Category
• Product Stock Availability
8.
Vastly simplified discovering all the products from retailers
@Pricesearcher #BrightonSEO
INITIAL CRAWLING TECH DEPENDED ON SITEMAPS
9.
@Pricesearcher #BrightonSEO
DATA SAMPLE
We will focus on 4000 UK retailers
we currently crawl using XML sitemaps discovering
20million+ products
10.
@Pricesearcher #BrightonSEO
TOP
10
Data Insights
from our crawling tech
11.
@Pricesearcher #BrightonSEO
1. SITEMAP DATA
have an XML sitemap
with product links
that’s regularly updated
91%
61%
54%
of retailer websites
of retailer websites
of retailer websites
12.
@Pricesearcher #BrightonSEO
2. BLOCKING OF CRAWLERS
have blocked us unintentionally
(generic robots.txt entry
or 403 automatic block)
have blocked us intentionally
(robots.txt entry)
2%
of retailer websites
0.05%
of retailer websites
13.
@Pricesearcher #BrightonSEO
3. EXTRACTION USING METADATA STANDARDS
have product title + price + image
defined using meta / opengraph tags
have product title + price + image
defined using meta / itemprop tags
(schema)
have product title + price + image defined
using both
41%
36%
12%
of retailer websites
of retailer websites
of retailer websites
14.
@Pricesearcher #BrightonSEO
4. EXTRACTION USING JAVASCRIPT
no info extracted due to heavy rendering
being uneconomical
price cannot be extracted as it is
converted / calculated on the fly
2%
of retailer websites
1%
of retailer websites
15.
@Pricesearcher #BrightonSEO
5. SITEMAP LINKS
have multiple links to the same
product pages
have multiple links to pages that
return 404 codes
2%
of retailer websites
3%
of retailer websites
16.
@Pricesearcher #BrightonSEO
6. PRODUCT IDENTIFIERS
provide a GTIN-14, EAN-13, UPC-12/8
for their products
provide an SKU for their products
provide an ISBN for their products
24%
of retailer websites
7%
of retailer websites
3%
of retailer websites
17.
@Pricesearcher #BrightonSEO
7. PRODUCT CATALOGUE SIZE
have less than 5000 product links in
their sitemap
have between 5000 and 30000 links
have more than 30000 links
14%
of retailer websites
79%
of retailer websites
7%
of retailer websites
18.
@Pricesearcher #BrightonSEO
8. DATA RICHNESS #1
provide a brand for their products
provide a category for their products
provide a stock indicator for their products
17%
of retailer websites
44%
of retailer websites
62%
of retailer websites
19.
@Pricesearcher #BrightonSEO
9. DATA RICHNESS #2 – NUMBER OF DIMENSIONS
Crawler 6 dimensions
Plugin
Product Feed
12 dimensions
23 dimensions
20.
@Pricesearcher #BrightonSEO
10. SITEMAP DISCOVERABILITY
list their sitemap in robots.txt33%
of retailer websites
21.
@Pricesearcher #BrightonSEO
TOP
5
Action Points
suggestions
22.
@Pricesearcher #BrightonSEO
ACTION POINT #1 - SITEMAP
• Have an XML sitemap
• Have the path of your sitemap listed in robots.txt
• Have your product pages in your sitemap
• Regularly update your sitemap
• Don’t point to 404 pages from your sitemap
23.
@Pricesearcher #BrightonSEO
ACTION POINT #2 - META / OPENGRAPH / ITEMPROP
• Provide structured information on your products using meta
itemprop (schema) or opengraph tags
• Provide as much structured data as possible
• Implement them as close as possible to the standards
24.
@Pricesearcher #BrightonSEO
ACTION POINT #3 – JAVASCRIPT & PRICE
• Be wary of the side effects of a javascript heavy site on crawling
• If you do implement a javascript heavy site, meta tags with
structured information are even more important!
• Be wary when converting the price based on geo location
• Don’t perform the price conversion in Javascript
25.
@Pricesearcher #BrightonSEO
ACTION POINT #4 - ANTI-CRAWL & ROBOTS.TXT
• Ask yourselves what’s the benefit of an anti-crawl mechanism
• Ask yourselves what’s the benefit of blocking all crawlers in
robots.txt
• Control the speed of crawlers using crawl-delay
26.
@Pricesearcher #BrightonSEO
ACTION POINT #5 - HAVE A SITEMAP MEETING
• Have a sitemap strategy, it’s just as important as your SEO strategy
• Sitemaps contribute massively to discoverability, yet are often overlooked
• Make sure you are doing everything you can to provide structured information
• Review your robots.txt contents
• Address missed opportunities from your sitemap sooner rather than later
27.
@Pricesearcher #BrightonSEO
THANKS FOR LISTENING!
Pricebot
http://www.pricesearcher.com/pricebot
Keen to hear from you with feedback about PriceBot or Pricesearcher in general.
Feel free to drop me a line at vlassios@pricesearcher.com or catch up with me at
our stand B11 in the expo hall
Editor's Notes
Unintentional blocks: Crawl-delay is very high that would take weeks to crawl a single site All user-agents are blocked in robots.txt Automated anti-crawl system kicks in and starts serving 403s