SlideShare a Scribd company logo
1 of 4
KohaCon12 – Edinburgh, June 5th, 2012


                    Adding browse to Koha using Solr
                                          Stefano Bargioni
                              Pontifical University Santa Croce – Rome

Slide 1
It's very exciting for me to take part to the Koha Conference for the first time. Thanks a lot to the
Community for everything I learnt during these days.

Slide The PUSC Library
Basic data about my library are resumed in this slide. We are very young, since my university was
founded only 26 years ago. It was inspired by Saint Josemaría Escrivá, founder of Opus Dei.
Twenty years ago we participated in the foundation of a consortium, URBE, the Roman Union of
Ecclesiastical Libraries.

Slide Why we need browse at PUSC?
The idea of alphabetically sorted lists of headings (authors, titles, series, subjects and so on) is
implemented in some LMS like another kind of search. We think it is not a “must”, thanks to the
power of simple and advanced searches. However, our users and the typology of our data suggested
us to add it to Koha.
Starting from Koha, our catalog experienced a strong increase in quality: we added full authority
records (we had only cross-references), and we started introducing subject headings. This is why we
are interested in browsing headings, coming from authority records as well as bibliographic records.

Slide How do you say?
Ancient authors, Popes, institutions, and other kind of authors, also due to the cataloguing rules
adopted by the library, can generate the needing of helping users and cataloguers to choose the
correct form for searching the catalog.
In the Virtual International Authority File, Dante Alighieri, who wrote the famous Divine Comedy,
has hundreds of varying forms. Which is the chosen form in your library?

Slide Grouping
Clustering and counting headings is another reason to use browse: it is interesting for managing and
searching series, looking at your catalog using Dewey, and so on.

Slide Browse Functionalities
What do you may ask to a browse tool? Basically, to navigate alphabetically sorted lists. So you
will need to extract headings from your catalog, build a sort form, and add information like, first of
all, usage count.




                                                 1/4
Slide Browse requirements
We tried to write a utility with the following requirements. The most important, maybe, is the ability
to include in the same list headings coming from different tags, from authority or bibliographic
records.
If I'm not wrong, its implementation is independent from the MARC flavour.

Slide The engine
We tried using Zebra, but it is very difficult for me to configure.
We considered MySQL, but SQL dbms do not have good performances when required to extract a
little subset of sorted records from a very large set of headings.
Solr was our choice as the search engine, due to its ability to work with facets. And its future
integration in Koha could be a win-win for the browse.

Slide The Solr document (1)
Solr works using document as a metaphor. Every heading we are interested in include in a list, will
be a Solr document.
In the Solr schema we defined some fields that we are going to discuss now in some slides.
The most important field is the ID. Since we can have identical sort forms under the same list, we
cannot use the sort form in the ID. For example, we need to distinguish title The Bible from title
Bible, even if their sort form is the same, due to the non-filing characters that strip out the initial
article.
The ID is of course the way used by Solr to delete or replace a document. It will be discussed in
detail afterwards.
Every document belongs to a list, it comes from authority or bibliographic records, from a tag and
from an occurrence of the tag. It also has a type: it can be a main heading, a see from, see also, and
so on.
It is unuseful in the Solr document to store information about subfields used to extract information.
Many times, every subfield will be extracted, but in other cases we only need some of them. The
configuration file will reflect this.

Slide The Solr document (2)
Here is an example of Solr document for the main author Dante Alighieri. Please note its ID.

Slide The Solr document (3)
And this is an example of Solr document for a title. Titles rather than uniform titles are not from
authority records. They will always have type 'acc', that is 'main'. Also note the ID.

Slide The Solr document (4)
The ID has a complex structure: we built it using a concatenation of list name, “a” for authority or
“b” for bibliographic, the authid or the biblionumber, the tag, the zero based occurrence number.
We think this is a unique identifier. If no, only the last heading with the same ID entered in Solr will
survive, leading to a silent error.

                                                  2/4
Slide The Solr document (5)
This screen shows the algorithm we use to build the sort form.
Maybe there is a better way to generate sort forms, taking into account that Koha is used in many
languages and in the same catalog there can be more than one script. Is International Components
for Unicode, aka ICU, the solution? I'm not so experienced... sorry.

Slide Architecture
The architecture is simple: a Solr db is updated with new or modified Koha records.
At the same time, users access the Solr db through the web and a Perl CGI.

Slide Loading & Synchronizing (1)
An important component of browse is the loader. We wrote it in Perl, with the ability to run for the
initial bulk loader as well as the updater.
It connects to Koha SQL tables in reading and adds or updates Solr documents.
The experience with Solr suggested us to issue commit and optimize commands on a regular basis,
to avoid memory consumption and ensure the fastest load. These parameters can vary depending on
the server running Solr.

Slide Loading & Synchronizing (2)
The configuration of the loader can be a large file. I chose XML but I know that the Koha developer
Community prefers YAML. Sorry.
It contains two main sections, one that gathers tag coming from authority records, the second one
for records coming from bibliographic records.
Here are two examples: on the left side, MARC21 authority tag 400 is sent to the list of authors,
type see. Every subfield will be copied. Suffix will ensure that the heading will end with the
specified string.
The example on the right side refers to a MARC21 bibliographic tag 245, i.e. a title. The
skip_indicator contains the number of the indicator where the skip in filing value is contained.
More preferences are available for each tag, like required_subfields and omit_subfields. They allow
to process tags with a higher level of detail.

Slide Loading & Synchronizing (3)
Solr db also contains some special documents, whose type is “system”. Two timestamps register the
start and the end of the update process, while each list has a counter to monitor its usage.
Four MySQL tables are involved. One of them, deleted_auth_header, is new. Whenever an authority
record is deleted, a slightly modified C4::AuthoritiesMarc.pm logs the event in this table.
The synchronizing process runs as a cron job. We chose to run it once a minute. A lock file ensures
that only one instance is running at the same time.

Slide Querying (1)
To access lists, we created a new page in Koha, with a link near the “Advanced Search”. The
screenshot shows public lists, the starting from text field and the number of results per page

                                                3/4
available.
This page is generated by a a CGI Perl script.

Slide Querying (2)
When listing 5 authors starting from Alighieri, we obtain this result. Each heading can be clicked to
access related documents, whose count is the number in the 3rd column. See also and Used for
headings, if any, are listed in the 4th column.
The red link, available only for authors, starts a search on the rich VIAF catalog. Due to its
completeness, very often we obtain a successful result. Of course, more links could be added, for
instance to the Wikipedia Biography Portal.
The count usage is performed on the fly. It is not stored in the Solr db. For headings coming from
authorities, this ensures that pressing the author name, will show the exact number of bibliographic
records even if the synchronization is not running.

Slide Querying (3)
When listing titles, the result page contains titles from many tags, including series titles, even if we
have a list with only series titles. To set apart series titles, we added a special gray label.
The usage count for headings that comes from bibliographic records is performed by Solr facets. In
fact, there will be for instance seventeen Solr documents (see the last line) with the same sort form
in the titles list.

Slide Statistics
A special button for statistics is available. It shows fresh counts for each list, as well as the search
counts (not shown here). A good way to monitor the Solr browse db.

Slide Security
Solr interaction is driven by http requests. In a standard installation, anybody could access
documents. It is very dangerous.
There are many ways to solve this issue. We chose to manage security setting a Jetty username and
password. Jetty is the application server included in the Solr standard distribution.

Slide License and portability
This implementation of browse is open sourced with the same license of Koha.
However it is not published yet. It requires more work to become a standard Koha tool, since the
manufacturer is not a Koha developer, is an abecedarian. I know that Claire Hernandez of BibLibre
has a lot of experience in Solr. I would be happy to share the source code with her.

Slide Grazie
Thank you very much to the Koha Community, now in Scotland!




                                                   4/4

More Related Content

What's hot

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
COM 501 Fall 2014 Library Session
COM 501 Fall 2014 Library SessionCOM 501 Fall 2014 Library Session
COM 501 Fall 2014 Library SessionMandi Goodsett
 
Tips for fixing OCLC Knowledge Base broken links
Tips for fixing OCLC Knowledge Base broken linksTips for fixing OCLC Knowledge Base broken links
Tips for fixing OCLC Knowledge Base broken linksJeff Siemon
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
How to create a new knowledge base collection for the oclc knowledge base if ...
How to create a new knowledge base collection for the oclc knowledge base if ...How to create a new knowledge base collection for the oclc knowledge base if ...
How to create a new knowledge base collection for the oclc knowledge base if ...Jeff Siemon
 
Add or Delete Files in a Library - SharePoint 2010 - EPC Group
Add or Delete Files in a Library - SharePoint 2010 - EPC GroupAdd or Delete Files in a Library - SharePoint 2010 - EPC Group
Add or Delete Files in a Library - SharePoint 2010 - EPC GroupEPC Group
 
Salesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command lineSalesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command lineCyrille Coeurjoly
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
Introduction to Mendeley - Mahantesh Biradar
Introduction to Mendeley - Mahantesh BiradarIntroduction to Mendeley - Mahantesh Biradar
Introduction to Mendeley - Mahantesh BiradarMahantesh Biradar
 
Refworks workshop 21 Aug 2014
Refworks workshop 21 Aug 2014Refworks workshop 21 Aug 2014
Refworks workshop 21 Aug 2014pvhead123
 

What's hot (19)

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
COM 501 Fall 2014 Library Session
COM 501 Fall 2014 Library SessionCOM 501 Fall 2014 Library Session
COM 501 Fall 2014 Library Session
 
Refworks Overview
Refworks Overview Refworks Overview
Refworks Overview
 
Tips for fixing OCLC Knowledge Base broken links
Tips for fixing OCLC Knowledge Base broken linksTips for fixing OCLC Knowledge Base broken links
Tips for fixing OCLC Knowledge Base broken links
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
How to create a new knowledge base collection for the oclc knowledge base if ...
How to create a new knowledge base collection for the oclc knowledge base if ...How to create a new knowledge base collection for the oclc knowledge base if ...
How to create a new knowledge base collection for the oclc knowledge base if ...
 
Add or Delete Files in a Library - SharePoint 2010 - EPC Group
Add or Delete Files in a Library - SharePoint 2010 - EPC GroupAdd or Delete Files in a Library - SharePoint 2010 - EPC Group
Add or Delete Files in a Library - SharePoint 2010 - EPC Group
 
Mahara
MaharaMahara
Mahara
 
Salesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command lineSalesforce Admin's guide : the data loader from the command line
Salesforce Admin's guide : the data loader from the command line
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Data loader.ppt
Data loader.pptData loader.ppt
Data loader.ppt
 
Managing electronic collections
Managing electronic collectionsManaging electronic collections
Managing electronic collections
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Savitch Ch 06
Savitch Ch 06Savitch Ch 06
Savitch Ch 06
 
Introduction to Mendeley - Mahantesh Biradar
Introduction to Mendeley - Mahantesh BiradarIntroduction to Mendeley - Mahantesh Biradar
Introduction to Mendeley - Mahantesh Biradar
 
Ref Works Training Staff1
Ref Works Training Staff1Ref Works Training Staff1
Ref Works Training Staff1
 
Refworks workshop 21 Aug 2014
Refworks workshop 21 Aug 2014Refworks workshop 21 Aug 2014
Refworks workshop 21 Aug 2014
 
Koha user manual
Koha user manualKoha user manual
Koha user manual
 
Savitch ch 06
Savitch ch 06Savitch ch 06
Savitch ch 06
 

Viewers also liked

Using Youth Development Approach to Foster Global Learning through Media & Te...
Using Youth Development Approach to Foster Global Learning through Media & Te...Using Youth Development Approach to Foster Global Learning through Media & Te...
Using Youth Development Approach to Foster Global Learning through Media & Te...pasesetter230
 
Zettastrom_Company-Profile_D_4.0
Zettastrom_Company-Profile_D_4.0Zettastrom_Company-Profile_D_4.0
Zettastrom_Company-Profile_D_4.0Abraham Auzan
 
Publication cover management in a library system (text)
Publication cover management in a library system (text)Publication cover management in a library system (text)
Publication cover management in a library system (text)Stefano Bargioni
 
UN Aviation Group Promotes Environmental Sustainability
UN Aviation Group Promotes Environmental SustainabilityUN Aviation Group Promotes Environmental Sustainability
UN Aviation Group Promotes Environmental SustainabilityDave Pflieger
 
Paresh presentation jan 2012
Paresh presentation jan 2012Paresh presentation jan 2012
Paresh presentation jan 2012anianilpande
 

Viewers also liked (6)

Using Youth Development Approach to Foster Global Learning through Media & Te...
Using Youth Development Approach to Foster Global Learning through Media & Te...Using Youth Development Approach to Foster Global Learning through Media & Te...
Using Youth Development Approach to Foster Global Learning through Media & Te...
 
Zettastrom_Company-Profile_D_4.0
Zettastrom_Company-Profile_D_4.0Zettastrom_Company-Profile_D_4.0
Zettastrom_Company-Profile_D_4.0
 
Stelline 2013
Stelline 2013Stelline 2013
Stelline 2013
 
Publication cover management in a library system (text)
Publication cover management in a library system (text)Publication cover management in a library system (text)
Publication cover management in a library system (text)
 
UN Aviation Group Promotes Environmental Sustainability
UN Aviation Group Promotes Environmental SustainabilityUN Aviation Group Promotes Environmental Sustainability
UN Aviation Group Promotes Environmental Sustainability
 
Paresh presentation jan 2012
Paresh presentation jan 2012Paresh presentation jan 2012
Paresh presentation jan 2012
 

Similar to Adding browse to Koha using Solr

Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Webrobin fay
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your Site
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your SiteDrupal and Apache Solr Search Go Together Like Pizza and Beer for Your Site
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your Sitenyccamp
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech docBarot Sagar
 
Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real WorldEmily Porta
 
RDA - an updated overview
RDA -  an updated overviewRDA -  an updated overview
RDA - an updated overviewrobin fay
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Mary Jo Sminkey
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using SolrStefano Bargioni
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2longkeyy
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryKen Varnum
 
C++ shared libraries and loading
C++ shared libraries and loadingC++ shared libraries and loading
C++ shared libraries and loadingRahul Jamwal
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basicsrobin fay
 

Similar to Adding browse to Koha using Solr (20)

Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your Site
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your SiteDrupal and Apache Solr Search Go Together Like Pizza and Beer for Your Site
Drupal and Apache Solr Search Go Together Like Pizza and Beer for Your Site
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech doc
 
Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real World
 
RDA - an updated overview
RDA -  an updated overviewRDA -  an updated overview
RDA - an updated overview
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using Solr
 
Koha Cataloguing Module
Koha Cataloguing ModuleKoha Cataloguing Module
Koha Cataloguing Module
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the Library
 
C++ shared libraries and loading
C++ shared libraries and loadingC++ shared libraries and loading
C++ shared libraries and loading
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basics
 
Xml+messaging+with+soap
Xml+messaging+with+soapXml+messaging+with+soap
Xml+messaging+with+soap
 

More from Stefano Bargioni

Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Stefano Bargioni
 
Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Stefano Bargioni
 
Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Stefano Bargioni
 
Koha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniKoha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniStefano Bargioni
 
Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Stefano Bargioni
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
 

More from Stefano Bargioni (8)

Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
 
Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)
 
Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)
 
Koha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniKoha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioni
 
Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Publication cover management in a library system (slides)
Publication cover management in a library system (slides)
 
Open, Big, & Linked Data
Open, Big, & Linked DataOpen, Big, & Linked Data
Open, Big, & Linked Data
 
Un nuovo motore per Koha
Un nuovo motore per KohaUn nuovo motore per Koha
Un nuovo motore per Koha
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...
 

Adding browse to Koha using Solr

  • 1. KohaCon12 – Edinburgh, June 5th, 2012 Adding browse to Koha using Solr Stefano Bargioni Pontifical University Santa Croce – Rome Slide 1 It's very exciting for me to take part to the Koha Conference for the first time. Thanks a lot to the Community for everything I learnt during these days. Slide The PUSC Library Basic data about my library are resumed in this slide. We are very young, since my university was founded only 26 years ago. It was inspired by Saint Josemaría Escrivá, founder of Opus Dei. Twenty years ago we participated in the foundation of a consortium, URBE, the Roman Union of Ecclesiastical Libraries. Slide Why we need browse at PUSC? The idea of alphabetically sorted lists of headings (authors, titles, series, subjects and so on) is implemented in some LMS like another kind of search. We think it is not a “must”, thanks to the power of simple and advanced searches. However, our users and the typology of our data suggested us to add it to Koha. Starting from Koha, our catalog experienced a strong increase in quality: we added full authority records (we had only cross-references), and we started introducing subject headings. This is why we are interested in browsing headings, coming from authority records as well as bibliographic records. Slide How do you say? Ancient authors, Popes, institutions, and other kind of authors, also due to the cataloguing rules adopted by the library, can generate the needing of helping users and cataloguers to choose the correct form for searching the catalog. In the Virtual International Authority File, Dante Alighieri, who wrote the famous Divine Comedy, has hundreds of varying forms. Which is the chosen form in your library? Slide Grouping Clustering and counting headings is another reason to use browse: it is interesting for managing and searching series, looking at your catalog using Dewey, and so on. Slide Browse Functionalities What do you may ask to a browse tool? Basically, to navigate alphabetically sorted lists. So you will need to extract headings from your catalog, build a sort form, and add information like, first of all, usage count. 1/4
  • 2. Slide Browse requirements We tried to write a utility with the following requirements. The most important, maybe, is the ability to include in the same list headings coming from different tags, from authority or bibliographic records. If I'm not wrong, its implementation is independent from the MARC flavour. Slide The engine We tried using Zebra, but it is very difficult for me to configure. We considered MySQL, but SQL dbms do not have good performances when required to extract a little subset of sorted records from a very large set of headings. Solr was our choice as the search engine, due to its ability to work with facets. And its future integration in Koha could be a win-win for the browse. Slide The Solr document (1) Solr works using document as a metaphor. Every heading we are interested in include in a list, will be a Solr document. In the Solr schema we defined some fields that we are going to discuss now in some slides. The most important field is the ID. Since we can have identical sort forms under the same list, we cannot use the sort form in the ID. For example, we need to distinguish title The Bible from title Bible, even if their sort form is the same, due to the non-filing characters that strip out the initial article. The ID is of course the way used by Solr to delete or replace a document. It will be discussed in detail afterwards. Every document belongs to a list, it comes from authority or bibliographic records, from a tag and from an occurrence of the tag. It also has a type: it can be a main heading, a see from, see also, and so on. It is unuseful in the Solr document to store information about subfields used to extract information. Many times, every subfield will be extracted, but in other cases we only need some of them. The configuration file will reflect this. Slide The Solr document (2) Here is an example of Solr document for the main author Dante Alighieri. Please note its ID. Slide The Solr document (3) And this is an example of Solr document for a title. Titles rather than uniform titles are not from authority records. They will always have type 'acc', that is 'main'. Also note the ID. Slide The Solr document (4) The ID has a complex structure: we built it using a concatenation of list name, “a” for authority or “b” for bibliographic, the authid or the biblionumber, the tag, the zero based occurrence number. We think this is a unique identifier. If no, only the last heading with the same ID entered in Solr will survive, leading to a silent error. 2/4
  • 3. Slide The Solr document (5) This screen shows the algorithm we use to build the sort form. Maybe there is a better way to generate sort forms, taking into account that Koha is used in many languages and in the same catalog there can be more than one script. Is International Components for Unicode, aka ICU, the solution? I'm not so experienced... sorry. Slide Architecture The architecture is simple: a Solr db is updated with new or modified Koha records. At the same time, users access the Solr db through the web and a Perl CGI. Slide Loading & Synchronizing (1) An important component of browse is the loader. We wrote it in Perl, with the ability to run for the initial bulk loader as well as the updater. It connects to Koha SQL tables in reading and adds or updates Solr documents. The experience with Solr suggested us to issue commit and optimize commands on a regular basis, to avoid memory consumption and ensure the fastest load. These parameters can vary depending on the server running Solr. Slide Loading & Synchronizing (2) The configuration of the loader can be a large file. I chose XML but I know that the Koha developer Community prefers YAML. Sorry. It contains two main sections, one that gathers tag coming from authority records, the second one for records coming from bibliographic records. Here are two examples: on the left side, MARC21 authority tag 400 is sent to the list of authors, type see. Every subfield will be copied. Suffix will ensure that the heading will end with the specified string. The example on the right side refers to a MARC21 bibliographic tag 245, i.e. a title. The skip_indicator contains the number of the indicator where the skip in filing value is contained. More preferences are available for each tag, like required_subfields and omit_subfields. They allow to process tags with a higher level of detail. Slide Loading & Synchronizing (3) Solr db also contains some special documents, whose type is “system”. Two timestamps register the start and the end of the update process, while each list has a counter to monitor its usage. Four MySQL tables are involved. One of them, deleted_auth_header, is new. Whenever an authority record is deleted, a slightly modified C4::AuthoritiesMarc.pm logs the event in this table. The synchronizing process runs as a cron job. We chose to run it once a minute. A lock file ensures that only one instance is running at the same time. Slide Querying (1) To access lists, we created a new page in Koha, with a link near the “Advanced Search”. The screenshot shows public lists, the starting from text field and the number of results per page 3/4
  • 4. available. This page is generated by a a CGI Perl script. Slide Querying (2) When listing 5 authors starting from Alighieri, we obtain this result. Each heading can be clicked to access related documents, whose count is the number in the 3rd column. See also and Used for headings, if any, are listed in the 4th column. The red link, available only for authors, starts a search on the rich VIAF catalog. Due to its completeness, very often we obtain a successful result. Of course, more links could be added, for instance to the Wikipedia Biography Portal. The count usage is performed on the fly. It is not stored in the Solr db. For headings coming from authorities, this ensures that pressing the author name, will show the exact number of bibliographic records even if the synchronization is not running. Slide Querying (3) When listing titles, the result page contains titles from many tags, including series titles, even if we have a list with only series titles. To set apart series titles, we added a special gray label. The usage count for headings that comes from bibliographic records is performed by Solr facets. In fact, there will be for instance seventeen Solr documents (see the last line) with the same sort form in the titles list. Slide Statistics A special button for statistics is available. It shows fresh counts for each list, as well as the search counts (not shown here). A good way to monitor the Solr browse db. Slide Security Solr interaction is driven by http requests. In a standard installation, anybody could access documents. It is very dangerous. There are many ways to solve this issue. We chose to manage security setting a Jetty username and password. Jetty is the application server included in the Solr standard distribution. Slide License and portability This implementation of browse is open sourced with the same license of Koha. However it is not published yet. It requires more work to become a standard Koha tool, since the manufacturer is not a Koha developer, is an abecedarian. I know that Claire Hernandez of BibLibre has a lot of experience in Solr. I would be happy to share the source code with her. Slide Grazie Thank you very much to the Koha Community, now in Scotland! 4/4