Adding browse to Koha using Solr


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Adding browse to Koha using Solr

  1. 1. KohaCon12 – Edinburgh, June 5th, 2012 Adding browse to Koha using Solr Stefano Bargioni Pontifical University Santa Croce – RomeSlide 1Its very exciting for me to take part to the Koha Conference for the first time. Thanks a lot to theCommunity for everything I learnt during these days.Slide The PUSC LibraryBasic data about my library are resumed in this slide. We are very young, since my university wasfounded only 26 years ago. It was inspired by Saint Josemaría Escrivá, founder of Opus Dei.Twenty years ago we participated in the foundation of a consortium, URBE, the Roman Union ofEcclesiastical Libraries.Slide Why we need browse at PUSC?The idea of alphabetically sorted lists of headings (authors, titles, series, subjects and so on) isimplemented in some LMS like another kind of search. We think it is not a “must”, thanks to thepower of simple and advanced searches. However, our users and the typology of our data suggestedus to add it to Koha.Starting from Koha, our catalog experienced a strong increase in quality: we added full authorityrecords (we had only cross-references), and we started introducing subject headings. This is why weare interested in browsing headings, coming from authority records as well as bibliographic records.Slide How do you say?Ancient authors, Popes, institutions, and other kind of authors, also due to the cataloguing rulesadopted by the library, can generate the needing of helping users and cataloguers to choose thecorrect form for searching the catalog.In the Virtual International Authority File, Dante Alighieri, who wrote the famous Divine Comedy,has hundreds of varying forms. Which is the chosen form in your library?Slide GroupingClustering and counting headings is another reason to use browse: it is interesting for managing andsearching series, looking at your catalog using Dewey, and so on.Slide Browse FunctionalitiesWhat do you may ask to a browse tool? Basically, to navigate alphabetically sorted lists. So youwill need to extract headings from your catalog, build a sort form, and add information like, first ofall, usage count. 1/4
  2. 2. Slide Browse requirementsWe tried to write a utility with the following requirements. The most important, maybe, is the abilityto include in the same list headings coming from different tags, from authority or bibliographicrecords.If Im not wrong, its implementation is independent from the MARC flavour.Slide The engineWe tried using Zebra, but it is very difficult for me to configure.We considered MySQL, but SQL dbms do not have good performances when required to extract alittle subset of sorted records from a very large set of headings.Solr was our choice as the search engine, due to its ability to work with facets. And its futureintegration in Koha could be a win-win for the browse.Slide The Solr document (1)Solr works using document as a metaphor. Every heading we are interested in include in a list, willbe a Solr document.In the Solr schema we defined some fields that we are going to discuss now in some slides.The most important field is the ID. Since we can have identical sort forms under the same list, wecannot use the sort form in the ID. For example, we need to distinguish title The Bible from titleBible, even if their sort form is the same, due to the non-filing characters that strip out the initialarticle.The ID is of course the way used by Solr to delete or replace a document. It will be discussed indetail afterwards.Every document belongs to a list, it comes from authority or bibliographic records, from a tag andfrom an occurrence of the tag. It also has a type: it can be a main heading, a see from, see also, andso on.It is unuseful in the Solr document to store information about subfields used to extract information.Many times, every subfield will be extracted, but in other cases we only need some of them. Theconfiguration file will reflect this.Slide The Solr document (2)Here is an example of Solr document for the main author Dante Alighieri. Please note its ID.Slide The Solr document (3)And this is an example of Solr document for a title. Titles rather than uniform titles are not fromauthority records. They will always have type acc, that is main. Also note the ID.Slide The Solr document (4)The ID has a complex structure: we built it using a concatenation of list name, “a” for authority or“b” for bibliographic, the authid or the biblionumber, the tag, the zero based occurrence number.We think this is a unique identifier. If no, only the last heading with the same ID entered in Solr willsurvive, leading to a silent error. 2/4
  3. 3. Slide The Solr document (5)This screen shows the algorithm we use to build the sort form.Maybe there is a better way to generate sort forms, taking into account that Koha is used in manylanguages and in the same catalog there can be more than one script. Is International Componentsfor Unicode, aka ICU, the solution? Im not so experienced... sorry.Slide ArchitectureThe architecture is simple: a Solr db is updated with new or modified Koha records.At the same time, users access the Solr db through the web and a Perl CGI.Slide Loading & Synchronizing (1)An important component of browse is the loader. We wrote it in Perl, with the ability to run for theinitial bulk loader as well as the updater.It connects to Koha SQL tables in reading and adds or updates Solr documents.The experience with Solr suggested us to issue commit and optimize commands on a regular basis,to avoid memory consumption and ensure the fastest load. These parameters can vary depending onthe server running Solr.Slide Loading & Synchronizing (2)The configuration of the loader can be a large file. I chose XML but I know that the Koha developerCommunity prefers YAML. Sorry.It contains two main sections, one that gathers tag coming from authority records, the second onefor records coming from bibliographic records.Here are two examples: on the left side, MARC21 authority tag 400 is sent to the list of authors,type see. Every subfield will be copied. Suffix will ensure that the heading will end with thespecified string.The example on the right side refers to a MARC21 bibliographic tag 245, i.e. a title. Theskip_indicator contains the number of the indicator where the skip in filing value is contained.More preferences are available for each tag, like required_subfields and omit_subfields. They allowto process tags with a higher level of detail.Slide Loading & Synchronizing (3)Solr db also contains some special documents, whose type is “system”. Two timestamps register thestart and the end of the update process, while each list has a counter to monitor its usage.Four MySQL tables are involved. One of them, deleted_auth_header, is new. Whenever an authorityrecord is deleted, a slightly modified logs the event in this table.The synchronizing process runs as a cron job. We chose to run it once a minute. A lock file ensuresthat only one instance is running at the same time.Slide Querying (1)To access lists, we created a new page in Koha, with a link near the “Advanced Search”. Thescreenshot shows public lists, the starting from text field and the number of results per page 3/4
  4. 4. available.This page is generated by a a CGI Perl script.Slide Querying (2)When listing 5 authors starting from Alighieri, we obtain this result. Each heading can be clicked toaccess related documents, whose count is the number in the 3rd column. See also and Used forheadings, if any, are listed in the 4th column.The red link, available only for authors, starts a search on the rich VIAF catalog. Due to itscompleteness, very often we obtain a successful result. Of course, more links could be added, forinstance to the Wikipedia Biography Portal.The count usage is performed on the fly. It is not stored in the Solr db. For headings coming fromauthorities, this ensures that pressing the author name, will show the exact number of bibliographicrecords even if the synchronization is not running.Slide Querying (3)When listing titles, the result page contains titles from many tags, including series titles, even if wehave a list with only series titles. To set apart series titles, we added a special gray label.The usage count for headings that comes from bibliographic records is performed by Solr facets. Infact, there will be for instance seventeen Solr documents (see the last line) with the same sort formin the titles list.Slide StatisticsA special button for statistics is available. It shows fresh counts for each list, as well as the searchcounts (not shown here). A good way to monitor the Solr browse db.Slide SecuritySolr interaction is driven by http requests. In a standard installation, anybody could accessdocuments. It is very dangerous.There are many ways to solve this issue. We chose to manage security setting a Jetty username andpassword. Jetty is the application server included in the Solr standard distribution.Slide License and portabilityThis implementation of browse is open sourced with the same license of Koha.However it is not published yet. It requires more work to become a standard Koha tool, since themanufacturer is not a Koha developer, is an abecedarian. I know that Claire Hernandez of BibLibrehas a lot of experience in Solr. I would be happy to share the source code with her.Slide GrazieThank you very much to the Koha Community, now in Scotland! 4/4