Dynamic sitemaps

Dynamic Sitemaps
Blacklight Virtual Summit
May 8, 2020
Charlie Morris
Lead Web Developer
Penn State University Libraries
Libraries Strategic Technologies
Discovery, Access and Web Services

Context on PSU Libraries
• Blacklight catalog (project name “BlackCat”) in Beta until Fall
• Vendor provided search interface remains the primary catalog
product for the Libraries
• 7.5+ million records
• Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL
• 100,000+ students across commonwealth and around the world

Letting the bots in
• Initially disallowed all bots in robots.txt
• As part of phased releasing closer to stable release we invited the
bots in

November 5, 2019
Prior to sitemap,
removed deny all for robots

How do people find you?
• Probably through a search engine.
• Probably Google.
• This is not a revelation.
• Search engines like sitemaps, especially critical for a site made up
entirely of dynamic links

A critical feature that is low hanging fruit
• Let users find content in channels they trust and use on a daily basis
(not defending these search engines, more that they are the critical
path for users)
• Why not compete with Amazon? Could save patrons some money
and increase use of library resources
• This isn’t a new revelation, of course, it’s more like ”low hanging
fruit”
• Note: no sitemap option in core Blacklight

The challenge of sitemaps on a large
repository
• < 50,000
• Solely dynamic links

Prior work
• Static sitemap generators
• https://github.com/jronallo/blacklight-sitemap
• https://github.com/kjvarga/sitemap_generator
• Operate by a scheduled task generating static files

A different approach: dynamic sitemaps
• Jack Reed of Stanford University Libraries and others create a POC
• Live query Solr for sitemap data
• Use a Rails’ controller to dictate what is displayed
• Use a Rails’ view to control the sitemap template
• Penn State University Libraries’ PR for the work:
• https://github.com/psu-libraries/psulib_blacklight/pull/511

The Query Recipe
• Necessary piece: a unique base 16 (hexadecimal) encoded hash for
each record indexed in Solr (call it the “signature”)
• lucene as the query parser
• Query parameter for “the signature starts with…” (q)
• Return the id and timestamp fields (fl)
• Make sure Solr isn’t attempting to calculate facets (facet)
• Specify a large number (rows) to prevent paging
More on query parameters from the Solr RefGuide

Making the signature with Solr
<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
name="add_hash_id">
<bool name="enabled">true</bool>
<str name="signatureField">hashed_id_si</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</updateProcessor>
More on this from the Solr RefGuide
Add this to your UpdateProcessorChain

“Signature starts with” for “Dynamic leaves”
• Depending on size of the index, tell Solr to create links to queries that
start with every combination of hexadecimal values for X placeholders
• Example: 0 to F for one placeholder = 16 “leaves”
• GET /sitemap: shows a list of 16 links to leaves like /sitemap/0
• GET /sitemap/0: a sitemap with every document that has a signature that
starts with 0

PSU Libraries Example: 4096 leaves

Update robots.txt
Crawl-delay: 10
Sitemap: http://catalog.libraries.psu.edu/sitemap.xml

More on slow growth
Google has known about 7+ million documents since November, but
growth is about 10,000 items per month, at this rate it will take 62 years
for Google to finish up

Light analysis
• About 20-50 visits a day
• 4,967 visits since launching it late November
• 4.4% of all traffic
• Screenshot below is daily visits over time from search engines via
Matomo Analytics (hey it used to be zero!)

Google is mysterious
• Slow growth despite the fact that they know about all records
• Search of site:catalog.libraries.psu.edu still only shows a few
thousand records despite Google’s dashboard reporting over 30
thousand

Bing is problematic
• Bing needed to be throttled, it hit us very hard to the point of a DOS
like behavior (thankful to have Sematext Performance Monitoring to
tattle on Bing)
• Used Bing webmaster tools to gain finer control over when the bot is
allowed to visit and how often
• Also set crawl delay to 10 in robots.txt (Google ignores this because
it’s smart enough to not DOS you)
• Not sure which of the above two factors solved the issue

Future
• Keep watching growth in Google Search Console
• Keep monitoring Matomo Analytics
• Discuss with others about their experiences in attempting to have
their repositories indexed by Google and others

Questions?
Incomplete gem: https://rubygems.org/gems/blacklight-sitemaps
Email cdm32@psu.edu
Twitter @cdmo
GitHub @cdmo

Dynamic sitemaps

Recommended

Recommended

More Related Content

Similar to Dynamic sitemaps

Similar to Dynamic sitemaps (20)

More from Charlie Morris

More from Charlie Morris (12)

Recently uploaded

Recently uploaded (20)

Dynamic sitemaps