This document discusses adding browse functionality to the Koha integrated library system using Apache Solr. Key points include:
- The PUSC Library wants to add browse to Koha to help users navigate subjects, authors, and related headings as their Aleph and Amicus systems previously supported this.
- Solr is proposed as the engine to power browse due to its flexibility, performance, and potential future integration into Koha to replace the current search tool.
- A process is outlined for loading authority and bibliographic records into a Solr database, synchronizing it with Koha, and querying it to power browse lists within the Koha OPAC.
- Statistics, security, licensing, and
Solr 3.1 includes many new features and improvements such as range faceting on numeric fields, geospatial search enhancements, JSON document indexing, autosuggest and spellcheck components, analysis filter improvements, and distributed support for additional components. Major components include Apache Lucene 3.1.0, Apache Tika 0.8, Carrot2 3.4.2, Velocity 1.6.1 and Velocity Tools 2.0-beta3, and Apache UIMA 2.3.1-SNAPSHOT.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It utilizes HDFS for storage, which distributes data across nodes and replicates files for fault tolerance. HDFS uses a master/slave architecture, with a NameNode managing the file system namespace and DataNodes storing file data in blocks. The Hadoop API provides access to HDFS through interfaces like FileSystem and FSDataInputStream, allowing applications to read, write, and manipulate data in a distributed manner.
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
Jan Høydahl presented what is new in Solr 4.0 including near real-time search capabilities, SolrCloud for distributed search across multiple cores, an improved spellchecker, smaller indexes using Flex, pluggable ranking, new sorting functions, and an updated admin GUI. Some key features being added in Solr 4.0 are support for Apache ZooKeeper, auto load balancing of queries across collections, and fault tolerant indexing.
Linux is an open source operating system initially created by Linus Torvalds in 1991. It has since grown significantly with hundreds of companies and individuals developing their own versions based on the Linux kernel. The kernel is developed under the GNU GPL license and its source code is freely available. Basic Linux commands allow users to navigate directories, manage files and permissions, transfer files, and get system information. More advanced commands provide additional control and functionality.
The document provides information about Apache Solr, an open source search platform written in Java. It discusses how Solr functions, how to install and configure it, options for indexing and querying data, and examples of common Solr operations like search, filtering, faceting and highlighting results.
Apache Solr is an enterprise search platform built on Apache Lucene. It provides fast, scalable search functionality and allows for spell checking, highlighting, faceting and more. Solr configurations are defined in schema.xml and solrconfig.xml files which specify fields, analyzers, caching and other settings. Documents are indexed and queried via HTTP requests to Solr servers. Liferay can integrate with Solr to offload search indexing and querying for improved performance in clustered environments.
This document provides information about scheduling jobs in UNIX/Linux systems. It discusses using the cron daemon to schedule jobs to run periodically based on time and date settings. It also covers using the at command to schedule single jobs to run once at a specific time. The crontab file format and common cron directories are described. It outlines how to list, delete, and manage scheduled jobs, and how user access to job scheduling is configured through cron access control files.
The document provides 40 tips for using basic Linux command line commands and tricks. Some key points include: everything in Linux is a file; # and $ denote superuser and normal users respectively; Ctrl+Alt+F1-F6 switch between terminals while Ctrl+Alt+F7 switches to the GUI; tilde ~ denotes the user's home directory; hidden files start with a dot; ls -a views hidden files; file permissions use rwx notation; and variables can be assigned text for repeated use.
Solr 3.1 includes many new features and improvements such as range faceting on numeric fields, geospatial search enhancements, JSON document indexing, autosuggest and spellcheck components, analysis filter improvements, and distributed support for additional components. Major components include Apache Lucene 3.1.0, Apache Tika 0.8, Carrot2 3.4.2, Velocity 1.6.1 and Velocity Tools 2.0-beta3, and Apache UIMA 2.3.1-SNAPSHOT.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It utilizes HDFS for storage, which distributes data across nodes and replicates files for fault tolerance. HDFS uses a master/slave architecture, with a NameNode managing the file system namespace and DataNodes storing file data in blocks. The Hadoop API provides access to HDFS through interfaces like FileSystem and FSDataInputStream, allowing applications to read, write, and manipulate data in a distributed manner.
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
Jan Høydahl presented what is new in Solr 4.0 including near real-time search capabilities, SolrCloud for distributed search across multiple cores, an improved spellchecker, smaller indexes using Flex, pluggable ranking, new sorting functions, and an updated admin GUI. Some key features being added in Solr 4.0 are support for Apache ZooKeeper, auto load balancing of queries across collections, and fault tolerant indexing.
Linux is an open source operating system initially created by Linus Torvalds in 1991. It has since grown significantly with hundreds of companies and individuals developing their own versions based on the Linux kernel. The kernel is developed under the GNU GPL license and its source code is freely available. Basic Linux commands allow users to navigate directories, manage files and permissions, transfer files, and get system information. More advanced commands provide additional control and functionality.
The document provides information about Apache Solr, an open source search platform written in Java. It discusses how Solr functions, how to install and configure it, options for indexing and querying data, and examples of common Solr operations like search, filtering, faceting and highlighting results.
Apache Solr is an enterprise search platform built on Apache Lucene. It provides fast, scalable search functionality and allows for spell checking, highlighting, faceting and more. Solr configurations are defined in schema.xml and solrconfig.xml files which specify fields, analyzers, caching and other settings. Documents are indexed and queried via HTTP requests to Solr servers. Liferay can integrate with Solr to offload search indexing and querying for improved performance in clustered environments.
This document provides information about scheduling jobs in UNIX/Linux systems. It discusses using the cron daemon to schedule jobs to run periodically based on time and date settings. It also covers using the at command to schedule single jobs to run once at a specific time. The crontab file format and common cron directories are described. It outlines how to list, delete, and manage scheduled jobs, and how user access to job scheduling is configured through cron access control files.
The document provides 40 tips for using basic Linux command line commands and tricks. Some key points include: everything in Linux is a file; # and $ denote superuser and normal users respectively; Ctrl+Alt+F1-F6 switch between terminals while Ctrl+Alt+F7 switches to the GUI; tilde ~ denotes the user's home directory; hidden files start with a dot; ls -a views hidden files; file permissions use rwx notation; and variables can be assigned text for repeated use.
The document describes a presentation given at KohaCon12 about adding browse functionality to Koha using Solr. It details the motivation for adding browse, the design of documents in the Solr index, how the index is loaded and synchronized with Koha, and how browse lists and results are queried. The goal is to provide a way to browse alphabetical lists of headings extracted from authority and bibliographic records in Koha.
This document provides an introduction to Apache Lucene and Solr. It begins with an overview of information retrieval and some basic concepts like term frequency-inverse document frequency. It then describes Lucene as a fast, scalable search library and discusses its inverted index and indexing pipeline. Solr is introduced as an enterprise search platform built on Lucene that provides features like faceting, scalability and real-time indexing. The document concludes with examples of how Lucene and Solr are used in applications and websites for search, analytics, auto-suggestion and more.
Introduction to Lucene & Solr and UsecasesRahul Jain
Rahul Jain gave a presentation on Lucene and Solr. He began with an overview of information retrieval and the inverted index. He then discussed Lucene, describing it as an open source information retrieval library for indexing and searching. He discussed Solr, describing it as an enterprise search platform built on Lucene that provides distributed indexing, replication, and load balancing. He provided examples of how Solr is used for search, analytics, auto-suggest, and more by companies like eBay, Netflix, and Twitter.
The document discusses Apache Solr, an open source search platform. It provides an overview of Solr, including its history and architecture. It also discusses how to set up a basic two shard Solr cluster with replicas and how Solr's schema works in a distributed environment. Lastly, it covers how to integrate Solr with other projects like Lucene, Zookeeper, Nutch, Mahout, Hadoop and ManifoldCF.
This document discusses building distributed search applications using Apache Solr. It provides an overview of Solr architecture and components like schema, indexing, querying etc. It also describes hands-on activities to index sample data from disk, database using Data Import Handler and SolrJ client. Query syntax for different types of queries and configuration of search handlers is also covered.
Apache Solr for TYPO3 Components & Review 2016timohund
The document discusses the Apache Solr extensions for TYPO3: EXT:solr for indexing pages and records, EXT:solrfal for indexing files, and EXT:solrfluid for fluid templates. It summarizes the developments and releases in 2016, including updates for TYPO3 7.6 and PHP 7.0, improved documentation, and new features like field collapsing and variants. The author invites involvement through GitHub, Slack, or becoming an extension partner.
This document provides an introduction to Apache Solr, an open-source enterprise search platform built on Apache Lucene. It discusses how Solr indexes content, processes search queries, and returns results with features like faceting, spellchecking, and scaling. The document also outlines how Solr works, how to configure and use it, and examples of large companies that employ Solr for search.
Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Video that accompanies this presentation at: http://www.youtube.com/watch?v=1t3Z2pJyulA
Join us for a guided tour of the Alfresco SOLR integration and new search sub-systems. We’ll discuss how it works, the limitations of eventual consistency, guidance for configuration and set-up. We’ll also cover the steps required to migrate, improved PATH performance, in-query ACL evaluation, cross-language support and monitoring as well as performance.
This document discusses building distributed search applications using Apache Solr. It provides an agenda that covers topics such as Solr architecture, schema configuration, indexing data, querying, SolrCloud, and performance factors. It also references a demo app that will be used for hands-on examples during the presentation.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
Apache Solr is a search platform built on Apache Lucene. It provides powerful indexing and search capabilities along with features like real-time indexing, faceted search, caching, and replication. Solr configuration is done through XML files that define aspects like tokenization, stemming, synonyms, and stop words. Solr uses REST services and exposes a HTTP interface to provide search functionality in a stateless manner.
Introduction to the basics of Information Retrieval (IR) with an emphasis on Apache Solr/Lucene. A lecture I gave during the JOSA Data Science Bootcamp.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
This document provides an overview of a data science conference where the keynote speaker will discuss using Apache Solr and Apache Spark together for data science applications. The speaker is the CTO of Lucidworks and will cover getting started with Solr and Spark, demoing how to index data, run analytics like clustering and classification, and more. Resources for learning more about Solr, Spark, and Lucidworks Fusion are also provided.
Solr & R to Deploy Custom Search Interface: Presented by Patrick Beaucamp, Bp...Lucidworks
This document summarizes a presentation about integrating Solr and R to enable custom search interfaces within AklaBox. The presentation discusses using R for text analysis and classification of documents before indexing them in Solr. This allows adding metadata to documents to improve search. It also presents using GoJS to visualize mind maps of word associations and OSM for geographic visualization. The presentation demonstrates these capabilities and discusses scaling to large document volumes using Spark, SolrRDD, and Vanilla Air.
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...NETWAYS
Immutable infrastructure is a way to success, but what about the lifecycle of individual resources. This talk is about evolution of resources, code structure, Terraform coding tricks, composition and refactoring.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Stefano Bargioni
Relationship designators are used to specify the relationship between a resource and a person, family, or corporate body associated with that resource. This presentation shows how they were added to the catalog of the library of the Pontificia Università della Santa Croce, in new and -mostly automatic- to legacy records. The Name Cloud, a way to navigate the catalog through related authors, is also shown.
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Stefano Bargioni
Relationship designators are used to specify the relationship between a resource and a person, family, or corporate body associated with that resource. This presentation shows how they were added to the catalog of the library of the Pontificia Università della Santa Croce, in new and -mostly automatic- in legacy records. The Name Cloud, a way to navigate the catalog through related authors, is also shown.
The document describes a presentation given at KohaCon12 about adding browse functionality to Koha using Solr. It details the motivation for adding browse, the design of documents in the Solr index, how the index is loaded and synchronized with Koha, and how browse lists and results are queried. The goal is to provide a way to browse alphabetical lists of headings extracted from authority and bibliographic records in Koha.
This document provides an introduction to Apache Lucene and Solr. It begins with an overview of information retrieval and some basic concepts like term frequency-inverse document frequency. It then describes Lucene as a fast, scalable search library and discusses its inverted index and indexing pipeline. Solr is introduced as an enterprise search platform built on Lucene that provides features like faceting, scalability and real-time indexing. The document concludes with examples of how Lucene and Solr are used in applications and websites for search, analytics, auto-suggestion and more.
Introduction to Lucene & Solr and UsecasesRahul Jain
Rahul Jain gave a presentation on Lucene and Solr. He began with an overview of information retrieval and the inverted index. He then discussed Lucene, describing it as an open source information retrieval library for indexing and searching. He discussed Solr, describing it as an enterprise search platform built on Lucene that provides distributed indexing, replication, and load balancing. He provided examples of how Solr is used for search, analytics, auto-suggest, and more by companies like eBay, Netflix, and Twitter.
The document discusses Apache Solr, an open source search platform. It provides an overview of Solr, including its history and architecture. It also discusses how to set up a basic two shard Solr cluster with replicas and how Solr's schema works in a distributed environment. Lastly, it covers how to integrate Solr with other projects like Lucene, Zookeeper, Nutch, Mahout, Hadoop and ManifoldCF.
This document discusses building distributed search applications using Apache Solr. It provides an overview of Solr architecture and components like schema, indexing, querying etc. It also describes hands-on activities to index sample data from disk, database using Data Import Handler and SolrJ client. Query syntax for different types of queries and configuration of search handlers is also covered.
Apache Solr for TYPO3 Components & Review 2016timohund
The document discusses the Apache Solr extensions for TYPO3: EXT:solr for indexing pages and records, EXT:solrfal for indexing files, and EXT:solrfluid for fluid templates. It summarizes the developments and releases in 2016, including updates for TYPO3 7.6 and PHP 7.0, improved documentation, and new features like field collapsing and variants. The author invites involvement through GitHub, Slack, or becoming an extension partner.
This document provides an introduction to Apache Solr, an open-source enterprise search platform built on Apache Lucene. It discusses how Solr indexes content, processes search queries, and returns results with features like faceting, spellchecking, and scaling. The document also outlines how Solr works, how to configure and use it, and examples of large companies that employ Solr for search.
Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Video that accompanies this presentation at: http://www.youtube.com/watch?v=1t3Z2pJyulA
Join us for a guided tour of the Alfresco SOLR integration and new search sub-systems. We’ll discuss how it works, the limitations of eventual consistency, guidance for configuration and set-up. We’ll also cover the steps required to migrate, improved PATH performance, in-query ACL evaluation, cross-language support and monitoring as well as performance.
This document discusses building distributed search applications using Apache Solr. It provides an agenda that covers topics such as Solr architecture, schema configuration, indexing data, querying, SolrCloud, and performance factors. It also references a demo app that will be used for hands-on examples during the presentation.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
Apache Solr is a search platform built on Apache Lucene. It provides powerful indexing and search capabilities along with features like real-time indexing, faceted search, caching, and replication. Solr configuration is done through XML files that define aspects like tokenization, stemming, synonyms, and stop words. Solr uses REST services and exposes a HTTP interface to provide search functionality in a stateless manner.
Introduction to the basics of Information Retrieval (IR) with an emphasis on Apache Solr/Lucene. A lecture I gave during the JOSA Data Science Bootcamp.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
This document provides an overview of a data science conference where the keynote speaker will discuss using Apache Solr and Apache Spark together for data science applications. The speaker is the CTO of Lucidworks and will cover getting started with Solr and Spark, demoing how to index data, run analytics like clustering and classification, and more. Resources for learning more about Solr, Spark, and Lucidworks Fusion are also provided.
Solr & R to Deploy Custom Search Interface: Presented by Patrick Beaucamp, Bp...Lucidworks
This document summarizes a presentation about integrating Solr and R to enable custom search interfaces within AklaBox. The presentation discusses using R for text analysis and classification of documents before indexing them in Solr. This allows adding metadata to documents to improve search. It also presents using GoJS to visualize mind maps of word associations and OSM for geographic visualization. The presentation demonstrates these capabilities and discusses scaling to large document volumes using Spark, SolrRDD, and Vanilla Air.
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...NETWAYS
Immutable infrastructure is a way to success, but what about the lifecycle of individual resources. This talk is about evolution of resources, code structure, Terraform coding tricks, composition and refactoring.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Stefano Bargioni
Relationship designators are used to specify the relationship between a resource and a person, family, or corporate body associated with that resource. This presentation shows how they were added to the catalog of the library of the Pontificia Università della Santa Croce, in new and -mostly automatic- to legacy records. The Name Cloud, a way to navigate the catalog through related authors, is also shown.
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Stefano Bargioni
Relationship designators are used to specify the relationship between a resource and a person, family, or corporate body associated with that resource. This presentation shows how they were added to the catalog of the library of the Pontificia Università della Santa Croce, in new and -mostly automatic- in legacy records. The Name Cloud, a way to navigate the catalog through related authors, is also shown.
Intervento al convegno "METODI SCELTE STRUMENTI : IL NUOVO CATALOGO DELLA RETE URBS", 11 giugno 2015. Video at https://www.youtube.com/watch?v=gK3_6NKJMzM
Publication cover management in a library system (text)Stefano Bargioni
Book covers can be stored in a Library Management System. This work -presented at 33rd ADLUG meeting in Piazza Armerina, oct 2014- discusses pros and cons, and how to collect book covers during cataloguing or circulation operations.
Publication cover management in a library system (slides)Stefano Bargioni
Book covers can be stored in a Library Management System. This work -presented at 33rd ADLUG meeting in Piazza Armerina, oct 2014- discusses pros and cons, and how to collect book covers during cataloguing or circulation operations.
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
Usually, important catalogs are accessed for copy-cataloguing whole records. It is possible to retrieve "atomic" information too, using unique keys like ISBN.
Library at Pontificia Università della S. Croce developed a tool that allows Dewey retrieval and insertion into bibliographic records, in bulk mode as well as in single record mode, i.e. during cataloguing.
During the bulk process, Dewey classification was added to about 20,000 records, retrieving it from OCLC, Library of Congress and some national libraries, up to 7 external sources.
The single record mode was integrated into the Koha ILS, to make easier to assign Dewey classification during cataloguing.
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
Usually, important catalogs are accessed for copy-cataloguing whole records. It is possible to retrieve "atomic" information too, using unique keys like ISBN.
Library at Pontificia Università della S. Croce developed a tool that allows Dewey retrieval and insertion into bibliographic records, in bulk mode as well as in single record mode, i.e. during cataloguing.
During the bulk process, Dewey classification was added to about 20,000 records, retrieving it from OCLC, Library of Congress and some national libraries, up to 7 external sources.
The single record mode was integrated into the Koha ILS, to make easier to assign Dewey classification during cataloguing.
1. KohaCon12
Adding browse to Koha
using Solr
<http://tinyurl.com/solr-browse>
Stefano Bargioni
Pontifical University Santa Croce – Rome
bargioni@pusc.it
2. The PUSC Library
● 160,000 volumes
– 147,000 bibs
– 111,300 auth
● Aleph 300; Amicus 3.5
● Koha 3.2.7 from May 1st, 2011
● PUSC belongs to the URBE Network
– 17 academic libraries
– 2 of them using Koha
Adding browse to Koha using Solr 2
3. Why we need browse at PUSC?
● Aleph 300 and Amicus offered it
● Our users and cataloguers frequently used it
● We have a lot of ancient authors, Popes, …,
requiring “seen from”, “see also”
● We started to add subjects to our bibliographic
records
Adding browse to Koha using Solr 3
4. How do you say?
● Alighieri, Dante or
● Dante Alighieri or
● Allighieri, Dante ?
● Ratzinger, Joseph, 1927- or
● Benedictus PP. XVI, 1927- or
● Papi (2005- : Benedictus XVI) ?
We have to help users and cataloguers to use the
correct form.
Adding browse to Koha using Solr 4
5. Grouping
● Uniform Titles
● Dewey
● Series, ...
Adding browse to Koha using Solr 5
6. Browse Functionalities
● Headings from authority as well as
bibliographic records
● Starting from
● Previous Headings, Next Headings
● Number of documents
● Related headings (see, see also, seen from)
● Go to authority record, if any
● Additional Links
Adding browse to Koha using Solr 6
7. Browse Requirements
● Indexes fed by headings coming from
– more than one auth tag
– more than one bib tag
● Sort form for Latin-1 (non-latin scripts?)
● Consider non-filing characters
● Synchronize frequently
● Integrated in Koha opac
● MARC flavour independence
Adding browse to Koha using Solr 7
8. The engine
● Why Solr?
– Schema flexibility
– Facets
– High performance in update and query
– Better than MySQL
– Will be part of Koha, maybe replacing
Zebra
Adding browse to Koha using Solr 8
9. The architecture
Web Perl CGI
browser
Koha loader.pl
SQL Solr db
tables
cron job
Adding browse to Koha using Solr 9
10. The Solr Document (1)
Field name Value
id unique identifier
authid | sysno int
au | tl | se ... string (display form)
sortform_au | sortform_tl | sortform_se... string
timestamp ISO 8601
type acc | see | also ...
Adding browse to Koha using Solr 10
11. The Solr Document (2)
Field name Example auth
id au_a_1234_100_0
authid 1234
au Alighieri, Dante
sortform_au alighieri.dante
timestamp 2012-05-23T19:10:54Z
type acc
Adding browse to Koha using Solr 11
12. The Solr Document (3)
Field name Example bib
id tl_b_5678_245_0
sysno 5678
tl Gesù Cristo secondo la dottrina di S. Tommaso
d'Aquino
sortform_tl gesu.cristo.secondo.la.dottrina.di.s.tommaso.d.aquino
timestamp 2012-05-23T18:15:44Z
type acc
Adding browse to Koha using Solr 12
13. The Solr Document (4)
Id structure
– List name au | tl | se ...
– Source a|b
– Source authid or sysno nnn
– Tag ttt
– Occurrence # n 0 based
Adding browse to Koha using Solr 13
14. The Solr Document (5)
The sort form:
– Diacritics to simple letter (àÀ to aA, ...)
● use Text::Unidecode;
– Lowercase
– Strip out non-filing characters (titles)
– Replace non a-z0-9 with dot
Used for facets
Adding browse to Koha using Solr 14
15. Loading & Synchronizing (1)
● The same cron based Perl script loads the Solr db for
the first time and updates it
– use C4::Context;
– use C4::AuthoritiesMarc;
– use WebService::Solr;
● 2000 # of docs modified before issuing a commit
● 5 # of commits before issuing an optimize
● 38 minutes to load 662,400 headings
● Configured through an xml file
Adding browse to Koha using Solr 15
16. Loading & Synchronizing (2)
● The XML config file (XML::Simple → YAML?):
– Two main sections: auth and bib
– each section lists tags that feed indexes
<tag> <tag>
<code>400</code> <code>130</code>
<list>au</list> <list>tl</list>
<type>see</type> <type>acc</type>
<subfields>*</subfields> <subfields>*</subfields>
<suffix>.</suffix> <skip_indicator>2</skip_indicator>
</tag> </tag>
<tag> <tag>
... ...
</tag> </tag>
Adding browse to Koha using Solr 16
17. Loading & Synchronizing (3)
● Special records in Solr, type:system
– Created if not exist, otherwise incremented
– An usage counter for each index
– Last update timestamps
● Search for new, modified or deleted records
– MySQL tables auth_header, biblioitems,
deletedbiblioitems, deleted_auth_header
– Modified AuthoritiesMarc.pm to fill
deleted_auth_header for auth deletion
● Cron once a minute, using a lock file
Adding browse to Koha using Solr 17
18. Querying (1)
A new page in Koha: Browse list of indexes
Ac
l as ook
res t use ie st
ul t d l i or e
s p st s
er and
pa
ge
Adding browse to Koha using Solr 18
19. Querying (2)
# of documents Related
C4::AuthoritiesMarc::CountUsage headings
Search VIAF
Show Koha auth record
Adding browse to Koha using Solr 19
20. Querying (3)
Titles list contains standard titles and series titles
Multivolume work
Adding browse to Koha using Solr 20
21. Statistics
Only for PUSC
Public
Staff only
Will be public
We started
some weeks
ago
Adding browse to Koha using Solr 21
22. Security
● Solr db can be erased with a single http
request
● Many ways to add admin security
● For instance, modify
– jetty.xml
– webdefault.xml
– realm.properties
Adding browse to Koha using Solr 22
23. License and portability
● The same as Koha
● Tested on Koha 3.2 and Koha 3.6
● Needs work to be included in Koha
– I18N
– .tt instead of AJAX
– Branches?
– Integration with Koha system preferences
● … Solr experts... (BibLibre?)
Adding browse to Koha using Solr 23
24. Thank you – Grazie!
Adding browse to Koha using Solr 24