Dynamic Sitemaps
Blacklight Virtual Summit
May 8, 2020
Charlie Morris
Lead Web Developer
Penn State University Libraries
Libraries Strategic Technologies
Discovery, Access and Web Services
Context on PSU Libraries
• Blacklight catalog (project name “BlackCat”) in Beta until Fall
• Vendor provided search interface remains the primary catalog
product for the Libraries
• 7.5+ million records
• Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL
• 100,000+ students across commonwealth and around the world
Letting the bots in
• Initially disallowed all bots in robots.txt
• As part of phased releasing closer to stable release we invited the
bots in
November 5, 2019
Prior to sitemap,
removed deny all for robots
How do people find you?
• Probably through a search engine.
• Probably Google.
• This is not a revelation.
• Search engines like sitemaps, especially critical for a site made up
entirely of dynamic links
A critical feature that is low hanging fruit
• Let users find content in channels they trust and use on a daily basis
(not defending these search engines, more that they are the critical
path for users)
• Why not compete with Amazon? Could save patrons some money
and increase use of library resources
• This isn’t a new revelation, of course, it’s more like ”low hanging
fruit”
• Note: no sitemap option in core Blacklight
The challenge of sitemaps on a large
repository
• < 50,000
• Solely dynamic links
Prior work
• Static sitemap generators
• https://github.com/jronallo/blacklight-sitemap
• https://github.com/kjvarga/sitemap_generator
• Operate by a scheduled task generating static files
A different approach: dynamic sitemaps
• Jack Reed of Stanford University Libraries and others create a POC
• Live query Solr for sitemap data
• Use a Rails’ controller to dictate what is displayed
• Use a Rails’ view to control the sitemap template
• Penn State University Libraries’ PR for the work:
• https://github.com/psu-libraries/psulib_blacklight/pull/511
The Query Recipe
• Necessary piece: a unique base 16 (hexadecimal) encoded hash for
each record indexed in Solr (call it the “signature”)
• lucene as the query parser
• Query parameter for “the signature starts with…” (q)
• Return the id and timestamp fields (fl)
• Make sure Solr isn’t attempting to calculate facets (facet)
• Specify a large number (rows) to prevent paging
More on query parameters from the Solr RefGuide
Making the signature with Solr
<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
name="add_hash_id">
<bool name="enabled">true</bool>
<str name="signatureField">hashed_id_si</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</updateProcessor>
More on this from the Solr RefGuide
Add this to your UpdateProcessorChain
“Signature starts with” for “Dynamic leaves”
• Depending on size of the index, tell Solr to create links to queries that
start with every combination of hexadecimal values for X placeholders
• Example: 0 to F for one placeholder = 16 “leaves”
• GET /sitemap: shows a list of 16 links to leaves like /sitemap/0
• GET /sitemap/0: a sitemap with every document that has a signature that
starts with 0
PSU Libraries Example: 4096 leaves
Update robots.txt
Crawl-delay: 10
Sitemap: http://catalog.libraries.psu.edu/sitemap.xml
Early Returns
Slow growth…
But hey…
More on slow growth
Google has known about 7+ million documents since November, but
growth is about 10,000 items per month, at this rate it will take 62 years
for Google to finish up
Light analysis
• About 20-50 visits a day
• 4,967 visits since launching it late November
• 4.4% of all traffic
• Screenshot below is daily visits over time from search engines via
Matomo Analytics (hey it used to be zero!)
Lessons learned
Google is mysterious
• Slow growth despite the fact that they know about all records
• Search of site:catalog.libraries.psu.edu still only shows a few
thousand records despite Google’s dashboard reporting over 30
thousand
Bing is problematic
• Bing needed to be throttled, it hit us very hard to the point of a DOS
like behavior (thankful to have Sematext Performance Monitoring to
tattle on Bing)
• Used Bing webmaster tools to gain finer control over when the bot is
allowed to visit and how often
• Also set crawl delay to 10 in robots.txt (Google ignores this because
it’s smart enough to not DOS you)
• Not sure which of the above two factors solved the issue
Future
• Keep watching growth in Google Search Console
• Keep monitoring Matomo Analytics
• Discuss with others about their experiences in attempting to have
their repositories indexed by Google and others
Questions?
Incomplete gem: https://rubygems.org/gems/blacklight-sitemaps
Email cdm32@psu.edu
Twitter @cdmo
GitHub @cdmo

Dynamic sitemaps

  • 1.
    Dynamic Sitemaps Blacklight VirtualSummit May 8, 2020 Charlie Morris Lead Web Developer Penn State University Libraries Libraries Strategic Technologies Discovery, Access and Web Services
  • 2.
    Context on PSULibraries • Blacklight catalog (project name “BlackCat”) in Beta until Fall • Vendor provided search interface remains the primary catalog product for the Libraries • 7.5+ million records • Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL • 100,000+ students across commonwealth and around the world
  • 3.
    Letting the botsin • Initially disallowed all bots in robots.txt • As part of phased releasing closer to stable release we invited the bots in
  • 4.
    November 5, 2019 Priorto sitemap, removed deny all for robots
  • 5.
    How do peoplefind you? • Probably through a search engine. • Probably Google. • This is not a revelation. • Search engines like sitemaps, especially critical for a site made up entirely of dynamic links
  • 6.
    A critical featurethat is low hanging fruit • Let users find content in channels they trust and use on a daily basis (not defending these search engines, more that they are the critical path for users) • Why not compete with Amazon? Could save patrons some money and increase use of library resources • This isn’t a new revelation, of course, it’s more like ”low hanging fruit” • Note: no sitemap option in core Blacklight
  • 7.
    The challenge ofsitemaps on a large repository • < 50,000 • Solely dynamic links
  • 8.
    Prior work • Staticsitemap generators • https://github.com/jronallo/blacklight-sitemap • https://github.com/kjvarga/sitemap_generator • Operate by a scheduled task generating static files
  • 9.
    A different approach:dynamic sitemaps • Jack Reed of Stanford University Libraries and others create a POC • Live query Solr for sitemap data • Use a Rails’ controller to dictate what is displayed • Use a Rails’ view to control the sitemap template • Penn State University Libraries’ PR for the work: • https://github.com/psu-libraries/psulib_blacklight/pull/511
  • 10.
    The Query Recipe •Necessary piece: a unique base 16 (hexadecimal) encoded hash for each record indexed in Solr (call it the “signature”) • lucene as the query parser • Query parameter for “the signature starts with…” (q) • Return the id and timestamp fields (fl) • Make sure Solr isn’t attempting to calculate facets (facet) • Specify a large number (rows) to prevent paging More on query parameters from the Solr RefGuide
  • 11.
    Making the signaturewith Solr <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" name="add_hash_id"> <bool name="enabled">true</bool> <str name="signatureField">hashed_id_si</str> <bool name="overwriteDupes">false</bool> <str name="fields">id</str> <str name="signatureClass">solr.processor.Lookup3Signature</str> </updateProcessor> More on this from the Solr RefGuide Add this to your UpdateProcessorChain
  • 12.
    “Signature starts with”for “Dynamic leaves” • Depending on size of the index, tell Solr to create links to queries that start with every combination of hexadecimal values for X placeholders • Example: 0 to F for one placeholder = 16 “leaves” • GET /sitemap: shows a list of 16 links to leaves like /sitemap/0 • GET /sitemap/0: a sitemap with every document that has a signature that starts with 0
  • 13.
  • 14.
    Update robots.txt Crawl-delay: 10 Sitemap:http://catalog.libraries.psu.edu/sitemap.xml
  • 15.
  • 16.
  • 18.
    More on slowgrowth Google has known about 7+ million documents since November, but growth is about 10,000 items per month, at this rate it will take 62 years for Google to finish up
  • 19.
    Light analysis • About20-50 visits a day • 4,967 visits since launching it late November • 4.4% of all traffic • Screenshot below is daily visits over time from search engines via Matomo Analytics (hey it used to be zero!)
  • 20.
  • 21.
    Google is mysterious •Slow growth despite the fact that they know about all records • Search of site:catalog.libraries.psu.edu still only shows a few thousand records despite Google’s dashboard reporting over 30 thousand
  • 22.
    Bing is problematic •Bing needed to be throttled, it hit us very hard to the point of a DOS like behavior (thankful to have Sematext Performance Monitoring to tattle on Bing) • Used Bing webmaster tools to gain finer control over when the bot is allowed to visit and how often • Also set crawl delay to 10 in robots.txt (Google ignores this because it’s smart enough to not DOS you) • Not sure which of the above two factors solved the issue
  • 24.
    Future • Keep watchinggrowth in Google Search Console • Keep monitoring Matomo Analytics • Discuss with others about their experiences in attempting to have their repositories indexed by Google and others
  • 25.