SlideShare a Scribd company logo
Dynamic Sitemaps
Blacklight Virtual Summit
May 8, 2020
Charlie Morris
Lead Web Developer
Penn State University Libraries
Libraries Strategic Technologies
Discovery, Access and Web Services
Context on PSU Libraries
• Blacklight catalog (project name “BlackCat”) in Beta until Fall
• Vendor provided search interface remains the primary catalog
product for the Libraries
• 7.5+ million records
• Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL
• 100,000+ students across commonwealth and around the world
Letting the bots in
• Initially disallowed all bots in robots.txt
• As part of phased releasing closer to stable release we invited the
bots in
November 5, 2019
Prior to sitemap,
removed deny all for robots
How do people find you?
• Probably through a search engine.
• Probably Google.
• This is not a revelation.
• Search engines like sitemaps, especially critical for a site made up
entirely of dynamic links
A critical feature that is low hanging fruit
• Let users find content in channels they trust and use on a daily basis
(not defending these search engines, more that they are the critical
path for users)
• Why not compete with Amazon? Could save patrons some money
and increase use of library resources
• This isn’t a new revelation, of course, it’s more like ”low hanging
fruit”
• Note: no sitemap option in core Blacklight
The challenge of sitemaps on a large
repository
• < 50,000
• Solely dynamic links
Prior work
• Static sitemap generators
• https://github.com/jronallo/blacklight-sitemap
• https://github.com/kjvarga/sitemap_generator
• Operate by a scheduled task generating static files
A different approach: dynamic sitemaps
• Jack Reed of Stanford University Libraries and others create a POC
• Live query Solr for sitemap data
• Use a Rails’ controller to dictate what is displayed
• Use a Rails’ view to control the sitemap template
• Penn State University Libraries’ PR for the work:
• https://github.com/psu-libraries/psulib_blacklight/pull/511
The Query Recipe
• Necessary piece: a unique base 16 (hexadecimal) encoded hash for
each record indexed in Solr (call it the “signature”)
• lucene as the query parser
• Query parameter for “the signature starts with…” (q)
• Return the id and timestamp fields (fl)
• Make sure Solr isn’t attempting to calculate facets (facet)
• Specify a large number (rows) to prevent paging
More on query parameters from the Solr RefGuide
Making the signature with Solr
<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
name="add_hash_id">
<bool name="enabled">true</bool>
<str name="signatureField">hashed_id_si</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</updateProcessor>
More on this from the Solr RefGuide
Add this to your UpdateProcessorChain
“Signature starts with” for “Dynamic leaves”
• Depending on size of the index, tell Solr to create links to queries that
start with every combination of hexadecimal values for X placeholders
• Example: 0 to F for one placeholder = 16 “leaves”
• GET /sitemap: shows a list of 16 links to leaves like /sitemap/0
• GET /sitemap/0: a sitemap with every document that has a signature that
starts with 0
PSU Libraries Example: 4096 leaves
Update robots.txt
Crawl-delay: 10
Sitemap: http://catalog.libraries.psu.edu/sitemap.xml
Early Returns
Slow growth…
But hey…
More on slow growth
Google has known about 7+ million documents since November, but
growth is about 10,000 items per month, at this rate it will take 62 years
for Google to finish up
Light analysis
• About 20-50 visits a day
• 4,967 visits since launching it late November
• 4.4% of all traffic
• Screenshot below is daily visits over time from search engines via
Matomo Analytics (hey it used to be zero!)
Lessons learned
Google is mysterious
• Slow growth despite the fact that they know about all records
• Search of site:catalog.libraries.psu.edu still only shows a few
thousand records despite Google’s dashboard reporting over 30
thousand
Bing is problematic
• Bing needed to be throttled, it hit us very hard to the point of a DOS
like behavior (thankful to have Sematext Performance Monitoring to
tattle on Bing)
• Used Bing webmaster tools to gain finer control over when the bot is
allowed to visit and how often
• Also set crawl delay to 10 in robots.txt (Google ignores this because
it’s smart enough to not DOS you)
• Not sure which of the above two factors solved the issue
Future
• Keep watching growth in Google Search Console
• Keep monitoring Matomo Analytics
• Discuss with others about their experiences in attempting to have
their repositories indexed by Google and others
Questions?
Incomplete gem: https://rubygems.org/gems/blacklight-sitemaps
Email cdm32@psu.edu
Twitter @cdmo
GitHub @cdmo

More Related Content

Similar to Dynamic sitemaps

Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
Anthony Baker
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIslibrarywebchic
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
Sammy Fung
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
Sara-Jayne Terp
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
mherbison
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329
radiocats
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard WallisCILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
Richard Wallis
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Denis Shestakov
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
NISO REST Training IIIF
NISO REST Training IIIF NISO REST Training IIIF
NISO REST Training IIIF
Glen Robson
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
Maté Ongenaert
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
Barbara Starr
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
Micah Altman
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
 
The API Journey: from REST to GraphQL
The API Journey: from REST to GraphQLThe API Journey: from REST to GraphQL
The API Journey: from REST to GraphQL
Haci Murat Yaman
 

Similar to Dynamic sitemaps (20)

Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIs
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard WallisCILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard Wallis
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
NISO REST Training IIIF
NISO REST Training IIIF NISO REST Training IIIF
NISO REST Training IIIF
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
The API Journey: from REST to GraphQL
The API Journey: from REST to GraphQLThe API Journey: from REST to GraphQL
The API Journey: from REST to GraphQL
 

More from Charlie Morris

Axe-matchers gem for automated accessibility testing
Axe-matchers gem for automated accessibility testing Axe-matchers gem for automated accessibility testing
Axe-matchers gem for automated accessibility testing
Charlie Morris
 
Content & Features Reno: Less Is More
Content & Features Reno: Less Is MoreContent & Features Reno: Less Is More
Content & Features Reno: Less Is More
Charlie Morris
 
Less is more: Getting Real About Content and Features
Less is more: Getting Real About Content and Features Less is more: Getting Real About Content and Features
Less is more: Getting Real About Content and Features
Charlie Morris
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
Charlie Morris
 
With Drupal Your Website is an API
With Drupal Your Website is an APIWith Drupal Your Website is an API
With Drupal Your Website is an API
Charlie Morris
 
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC LibrariesResponsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Charlie Morris
 
Boiling a Frog: A Responsive Update
Boiling a Frog: A Responsive UpdateBoiling a Frog: A Responsive Update
Boiling a Frog: A Responsive UpdateCharlie Morris
 
Viral Outreach: Blending Online and Offline Social Networks
Viral Outreach: Blending Online and Offline Social NetworksViral Outreach: Blending Online and Offline Social Networks
Viral Outreach: Blending Online and Offline Social Networks
Charlie Morris
 
Creating the Hunt Partners App: Cross-Departmental Rapid Response
Creating the Hunt Partners App: Cross-Departmental Rapid ResponseCreating the Hunt Partners App: Cross-Departmental Rapid Response
Creating the Hunt Partners App: Cross-Departmental Rapid Response
Charlie Morris
 
Google Analytics Basics for NCSU Libraries' Staff
Google Analytics Basics for NCSU Libraries' StaffGoogle Analytics Basics for NCSU Libraries' Staff
Google Analytics Basics for NCSU Libraries' StaffCharlie Morris
 
Exposing Tech Lending Device Availability Data
Exposing Tech Lending Device Availability DataExposing Tech Lending Device Availability Data
Exposing Tech Lending Device Availability Data
Charlie Morris
 
5 Ways to Make Use of Your Google Analytics
5 Ways to Make Use of Your Google Analytics5 Ways to Make Use of Your Google Analytics
5 Ways to Make Use of Your Google Analytics
Charlie Morris
 

More from Charlie Morris (12)

Axe-matchers gem for automated accessibility testing
Axe-matchers gem for automated accessibility testing Axe-matchers gem for automated accessibility testing
Axe-matchers gem for automated accessibility testing
 
Content & Features Reno: Less Is More
Content & Features Reno: Less Is MoreContent & Features Reno: Less Is More
Content & Features Reno: Less Is More
 
Less is more: Getting Real About Content and Features
Less is more: Getting Real About Content and Features Less is more: Getting Real About Content and Features
Less is more: Getting Real About Content and Features
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
 
With Drupal Your Website is an API
With Drupal Your Website is an APIWith Drupal Your Website is an API
With Drupal Your Website is an API
 
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC LibrariesResponsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
 
Boiling a Frog: A Responsive Update
Boiling a Frog: A Responsive UpdateBoiling a Frog: A Responsive Update
Boiling a Frog: A Responsive Update
 
Viral Outreach: Blending Online and Offline Social Networks
Viral Outreach: Blending Online and Offline Social NetworksViral Outreach: Blending Online and Offline Social Networks
Viral Outreach: Blending Online and Offline Social Networks
 
Creating the Hunt Partners App: Cross-Departmental Rapid Response
Creating the Hunt Partners App: Cross-Departmental Rapid ResponseCreating the Hunt Partners App: Cross-Departmental Rapid Response
Creating the Hunt Partners App: Cross-Departmental Rapid Response
 
Google Analytics Basics for NCSU Libraries' Staff
Google Analytics Basics for NCSU Libraries' StaffGoogle Analytics Basics for NCSU Libraries' Staff
Google Analytics Basics for NCSU Libraries' Staff
 
Exposing Tech Lending Device Availability Data
Exposing Tech Lending Device Availability DataExposing Tech Lending Device Availability Data
Exposing Tech Lending Device Availability Data
 
5 Ways to Make Use of Your Google Analytics
5 Ways to Make Use of Your Google Analytics5 Ways to Make Use of Your Google Analytics
5 Ways to Make Use of Your Google Analytics
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 

Dynamic sitemaps

  • 1. Dynamic Sitemaps Blacklight Virtual Summit May 8, 2020 Charlie Morris Lead Web Developer Penn State University Libraries Libraries Strategic Technologies Discovery, Access and Web Services
  • 2. Context on PSU Libraries • Blacklight catalog (project name “BlackCat”) in Beta until Fall • Vendor provided search interface remains the primary catalog product for the Libraries • 7.5+ million records • Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL • 100,000+ students across commonwealth and around the world
  • 3. Letting the bots in • Initially disallowed all bots in robots.txt • As part of phased releasing closer to stable release we invited the bots in
  • 4. November 5, 2019 Prior to sitemap, removed deny all for robots
  • 5. How do people find you? • Probably through a search engine. • Probably Google. • This is not a revelation. • Search engines like sitemaps, especially critical for a site made up entirely of dynamic links
  • 6. A critical feature that is low hanging fruit • Let users find content in channels they trust and use on a daily basis (not defending these search engines, more that they are the critical path for users) • Why not compete with Amazon? Could save patrons some money and increase use of library resources • This isn’t a new revelation, of course, it’s more like ”low hanging fruit” • Note: no sitemap option in core Blacklight
  • 7. The challenge of sitemaps on a large repository • < 50,000 • Solely dynamic links
  • 8. Prior work • Static sitemap generators • https://github.com/jronallo/blacklight-sitemap • https://github.com/kjvarga/sitemap_generator • Operate by a scheduled task generating static files
  • 9. A different approach: dynamic sitemaps • Jack Reed of Stanford University Libraries and others create a POC • Live query Solr for sitemap data • Use a Rails’ controller to dictate what is displayed • Use a Rails’ view to control the sitemap template • Penn State University Libraries’ PR for the work: • https://github.com/psu-libraries/psulib_blacklight/pull/511
  • 10. The Query Recipe • Necessary piece: a unique base 16 (hexadecimal) encoded hash for each record indexed in Solr (call it the “signature”) • lucene as the query parser • Query parameter for “the signature starts with…” (q) • Return the id and timestamp fields (fl) • Make sure Solr isn’t attempting to calculate facets (facet) • Specify a large number (rows) to prevent paging More on query parameters from the Solr RefGuide
  • 11. Making the signature with Solr <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" name="add_hash_id"> <bool name="enabled">true</bool> <str name="signatureField">hashed_id_si</str> <bool name="overwriteDupes">false</bool> <str name="fields">id</str> <str name="signatureClass">solr.processor.Lookup3Signature</str> </updateProcessor> More on this from the Solr RefGuide Add this to your UpdateProcessorChain
  • 12. “Signature starts with” for “Dynamic leaves” • Depending on size of the index, tell Solr to create links to queries that start with every combination of hexadecimal values for X placeholders • Example: 0 to F for one placeholder = 16 “leaves” • GET /sitemap: shows a list of 16 links to leaves like /sitemap/0 • GET /sitemap/0: a sitemap with every document that has a signature that starts with 0
  • 13. PSU Libraries Example: 4096 leaves
  • 14. Update robots.txt Crawl-delay: 10 Sitemap: http://catalog.libraries.psu.edu/sitemap.xml
  • 17.
  • 18. More on slow growth Google has known about 7+ million documents since November, but growth is about 10,000 items per month, at this rate it will take 62 years for Google to finish up
  • 19. Light analysis • About 20-50 visits a day • 4,967 visits since launching it late November • 4.4% of all traffic • Screenshot below is daily visits over time from search engines via Matomo Analytics (hey it used to be zero!)
  • 21. Google is mysterious • Slow growth despite the fact that they know about all records • Search of site:catalog.libraries.psu.edu still only shows a few thousand records despite Google’s dashboard reporting over 30 thousand
  • 22. Bing is problematic • Bing needed to be throttled, it hit us very hard to the point of a DOS like behavior (thankful to have Sematext Performance Monitoring to tattle on Bing) • Used Bing webmaster tools to gain finer control over when the bot is allowed to visit and how often • Also set crawl delay to 10 in robots.txt (Google ignores this because it’s smart enough to not DOS you) • Not sure which of the above two factors solved the issue
  • 23.
  • 24. Future • Keep watching growth in Google Search Console • Keep monitoring Matomo Analytics • Discuss with others about their experiences in attempting to have their repositories indexed by Google and others