Catalog enrichment: importing Dewey Decimal Classification from external sour...
Adding browse to Koha using Solr
1. KohaCon12 – Edinburgh, June 5th, 2012
Adding browse to Koha using Solr
Stefano Bargioni
Pontifical University Santa Croce – Rome
Slide 1
It's very exciting for me to take part to the Koha Conference for the first time. Thanks a lot to the
Community for everything I learnt during these days.
Slide The PUSC Library
Basic data about my library are resumed in this slide. We are very young, since my university was
founded only 26 years ago. It was inspired by Saint Josemaría Escrivá, founder of Opus Dei.
Twenty years ago we participated in the foundation of a consortium, URBE, the Roman Union of
Ecclesiastical Libraries.
Slide Why we need browse at PUSC?
The idea of alphabetically sorted lists of headings (authors, titles, series, subjects and so on) is
implemented in some LMS like another kind of search. We think it is not a “must”, thanks to the
power of simple and advanced searches. However, our users and the typology of our data suggested
us to add it to Koha.
Starting from Koha, our catalog experienced a strong increase in quality: we added full authority
records (we had only cross-references), and we started introducing subject headings. This is why we
are interested in browsing headings, coming from authority records as well as bibliographic records.
Slide How do you say?
Ancient authors, Popes, institutions, and other kind of authors, also due to the cataloguing rules
adopted by the library, can generate the needing of helping users and cataloguers to choose the
correct form for searching the catalog.
In the Virtual International Authority File, Dante Alighieri, who wrote the famous Divine Comedy,
has hundreds of varying forms. Which is the chosen form in your library?
Slide Grouping
Clustering and counting headings is another reason to use browse: it is interesting for managing and
searching series, looking at your catalog using Dewey, and so on.
Slide Browse Functionalities
What do you may ask to a browse tool? Basically, to navigate alphabetically sorted lists. So you
will need to extract headings from your catalog, build a sort form, and add information like, first of
all, usage count.
1/4
2. Slide Browse requirements
We tried to write a utility with the following requirements. The most important, maybe, is the ability
to include in the same list headings coming from different tags, from authority or bibliographic
records.
If I'm not wrong, its implementation is independent from the MARC flavour.
Slide The engine
We tried using Zebra, but it is very difficult for me to configure.
We considered MySQL, but SQL dbms do not have good performances when required to extract a
little subset of sorted records from a very large set of headings.
Solr was our choice as the search engine, due to its ability to work with facets. And its future
integration in Koha could be a win-win for the browse.
Slide The Solr document (1)
Solr works using document as a metaphor. Every heading we are interested in include in a list, will
be a Solr document.
In the Solr schema we defined some fields that we are going to discuss now in some slides.
The most important field is the ID. Since we can have identical sort forms under the same list, we
cannot use the sort form in the ID. For example, we need to distinguish title The Bible from title
Bible, even if their sort form is the same, due to the non-filing characters that strip out the initial
article.
The ID is of course the way used by Solr to delete or replace a document. It will be discussed in
detail afterwards.
Every document belongs to a list, it comes from authority or bibliographic records, from a tag and
from an occurrence of the tag. It also has a type: it can be a main heading, a see from, see also, and
so on.
It is unuseful in the Solr document to store information about subfields used to extract information.
Many times, every subfield will be extracted, but in other cases we only need some of them. The
configuration file will reflect this.
Slide The Solr document (2)
Here is an example of Solr document for the main author Dante Alighieri. Please note its ID.
Slide The Solr document (3)
And this is an example of Solr document for a title. Titles rather than uniform titles are not from
authority records. They will always have type 'acc', that is 'main'. Also note the ID.
Slide The Solr document (4)
The ID has a complex structure: we built it using a concatenation of list name, “a” for authority or
“b” for bibliographic, the authid or the biblionumber, the tag, the zero based occurrence number.
We think this is a unique identifier. If no, only the last heading with the same ID entered in Solr will
survive, leading to a silent error.
2/4
3. Slide The Solr document (5)
This screen shows the algorithm we use to build the sort form.
Maybe there is a better way to generate sort forms, taking into account that Koha is used in many
languages and in the same catalog there can be more than one script. Is International Components
for Unicode, aka ICU, the solution? I'm not so experienced... sorry.
Slide Architecture
The architecture is simple: a Solr db is updated with new or modified Koha records.
At the same time, users access the Solr db through the web and a Perl CGI.
Slide Loading & Synchronizing (1)
An important component of browse is the loader. We wrote it in Perl, with the ability to run for the
initial bulk loader as well as the updater.
It connects to Koha SQL tables in reading and adds or updates Solr documents.
The experience with Solr suggested us to issue commit and optimize commands on a regular basis,
to avoid memory consumption and ensure the fastest load. These parameters can vary depending on
the server running Solr.
Slide Loading & Synchronizing (2)
The configuration of the loader can be a large file. I chose XML but I know that the Koha developer
Community prefers YAML. Sorry.
It contains two main sections, one that gathers tag coming from authority records, the second one
for records coming from bibliographic records.
Here are two examples: on the left side, MARC21 authority tag 400 is sent to the list of authors,
type see. Every subfield will be copied. Suffix will ensure that the heading will end with the
specified string.
The example on the right side refers to a MARC21 bibliographic tag 245, i.e. a title. The
skip_indicator contains the number of the indicator where the skip in filing value is contained.
More preferences are available for each tag, like required_subfields and omit_subfields. They allow
to process tags with a higher level of detail.
Slide Loading & Synchronizing (3)
Solr db also contains some special documents, whose type is “system”. Two timestamps register the
start and the end of the update process, while each list has a counter to monitor its usage.
Four MySQL tables are involved. One of them, deleted_auth_header, is new. Whenever an authority
record is deleted, a slightly modified C4::AuthoritiesMarc.pm logs the event in this table.
The synchronizing process runs as a cron job. We chose to run it once a minute. A lock file ensures
that only one instance is running at the same time.
Slide Querying (1)
To access lists, we created a new page in Koha, with a link near the “Advanced Search”. The
screenshot shows public lists, the starting from text field and the number of results per page
3/4
4. available.
This page is generated by a a CGI Perl script.
Slide Querying (2)
When listing 5 authors starting from Alighieri, we obtain this result. Each heading can be clicked to
access related documents, whose count is the number in the 3rd column. See also and Used for
headings, if any, are listed in the 4th column.
The red link, available only for authors, starts a search on the rich VIAF catalog. Due to its
completeness, very often we obtain a successful result. Of course, more links could be added, for
instance to the Wikipedia Biography Portal.
The count usage is performed on the fly. It is not stored in the Solr db. For headings coming from
authorities, this ensures that pressing the author name, will show the exact number of bibliographic
records even if the synchronization is not running.
Slide Querying (3)
When listing titles, the result page contains titles from many tags, including series titles, even if we
have a list with only series titles. To set apart series titles, we added a special gray label.
The usage count for headings that comes from bibliographic records is performed by Solr facets. In
fact, there will be for instance seventeen Solr documents (see the last line) with the same sort form
in the titles list.
Slide Statistics
A special button for statistics is available. It shows fresh counts for each list, as well as the search
counts (not shown here). A good way to monitor the Solr browse db.
Slide Security
Solr interaction is driven by http requests. In a standard installation, anybody could access
documents. It is very dangerous.
There are many ways to solve this issue. We chose to manage security setting a Jetty username and
password. Jetty is the application server included in the Solr standard distribution.
Slide License and portability
This implementation of browse is open sourced with the same license of Koha.
However it is not published yet. It requires more work to become a standard Koha tool, since the
manufacturer is not a Koha developer, is an abecedarian. I know that Claire Hernandez of BibLibre
has a lot of experience in Solr. I would be happy to share the source code with her.
Slide Grazie
Thank you very much to the Koha Community, now in Scotland!
4/4