Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

374 views
293 views

Published on

Usually, important catalogs are accessed for copy-cataloguing whole records. It is possible to retrieve "atomic" information too, using unique keys like ISBN.
Library at Pontificia Università della S. Croce developed a tool that allows Dewey retrieval and insertion into bibliographic records, in bulk mode as well as in single record mode, i.e. during cataloguing.
During the bulk process, Dewey classification was added to about 20,000 records, retrieving it from OCLC, Library of Congress and some national libraries, up to 7 external sources.
The single record mode was integrated into the Koha ILS, to make easier to assign Dewey classification during cataloguing.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
374
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

  1. 1. 32 ADLUG Meeting - Vitoria-Gasteiz, October 2013 Stefano Bargioni - Pontificia Università della Santa Croce Catalog enrichment: importing Dewey Decimal Classification from external sources Slide 2 - The project The project related to this presentation started as a real need in the library of the Santa Croce Pontifical University. After migrating to Koha open source ILS <http://koha-community.org> and adding authority records for names, name-titles and uniform titles to the catalog, we studied how to add subject headings. Even if we were using uncontrolled subject headings starting from many years ago, we realized that this job is nor simple neither fast. Our catalog, in the meanwhile, would have had to be accessible through other semantic search paths. So we decided to improve the Dewey Decimal Classification (DDC). Not manually, of course, but using an automatic procedure. Slide 3 - Version 1: The batch mode Batch mode is the process that scans the catalog and automatically adds DDC numbers to some records. This process can span hours or days, depending on the catalog size and other factors, and is meant to upgrade old bibliographic records. A similar process can be applied in interactive mode, for new bibliographic records. We will discuss it in the last part of the presentation.
  2. 2. The aim of this process is to enrich the catalog adding DDC to records, minimizing or even excluding human interactions. The process will detect records without the DDC (field 082), but having the ISBN (field 020). The ISBN, a unique identifier, will be used to discover the corresponding record on other databases where the DDC is in use. Big catalogs like national libraries or bibliographies can be good choices. However, the DDC is not used everywhere at the same level of quality or uniformity. This is why we defined some criteria to check each DDC number before adding it to the catalog. Slide 4 - An atomic copy cataloguing Adding a single field to a record is very similar to copy cataloguing. We named it "atomic copy cataloguing". It is based on a unique identifier (often ISBN), and requires that the you are allowed to programmatically modify single records in your Integrated Library System (ILS). This can be done if Advanced Programming Interfaces (APIs) are available. For commercial ILS's there can be restrictions or reasons to avoid this kind of operation. Since this "atomic copy" is a subset of the copy cataloguing, we can say that its use is allowed thanks to the same basic ideas and agreements that govern full copy cataloguing in the library international community. Slide 5 - Records to be modified This slides shows a way to detect records involved in the process, extract their system number and ISBN. We also choose to limit to the language of the work, to avoid to send lot of unuseful queries to some remote
  3. 3. servers that mostly contain records of their country language. Slide 6 - Dewey Sources (I) OCLC Classify and other six national libraries or bibliographies were selected as Dewey sources. It is necessary that they can be accessed using a protocol like REST, Z39.50 or HTTP, in order to receive responses in XML, MARC binary or HTML data. The best is XML, i.e. structured data; the worst is HTML, because in this case data do not have metadata and are mixed with or merged into formatting code, and are difficult to extract. Slide 7 - Dewey Sources (II): OCLC Classify Even if it is interesting, we can avoid a detailed description of Classify OCLC service. It is important to notice what is highlighted in red color: • it has a machine interface, and its response is in XML format; • ISBN is a keyword to access it; • it contains 36 million records having the DDC. Slide 8 - Dewey Sources (III): National Libraries The table shows six important national libraries we queried, the language we choose to limit queries, and the format of responses. The order followed to query Dewey sources was the same used in this table, with OCLC Classify as the first one. Note that there are no Spanish sources, and we are very interested to find one, if any.
  4. 4. Slide 9 - The logic used in the programs As usual, we can avoid to enter in details about the programs we wrote. Red color refers to quality control, as mentioned before, and the policy we adopted to avoid overloading. We are going to discuss both. Slide 10 - Quality check At the start of the project, our catalog contained DDC down to edition 19th. In OCLC Classify or Library of Congress catalogs there are many DDC numbers belonging to older editions, or the edition is not specified. Or indicators are not present. We decided to discard and not accept DDC numbers inconsistent with our quality standard. Even if many of them were discarded, we added 50% more DDC numbers to our catalog. Less strict criteria can lead to add more DDC numbers, but with lower quality. Slide 11 - Delay while searching sources When running, programs can query Dewey targets at very high rate. This can suffocate the remote server, especially if it is accessed by other users. Furthermore, if records of your catalog are continuously modified, their indexing can overload your ILS. To avoid that, and ensure you are respecting access policies sometime set by remote catalogs, we defined a delay of 5-6 seconds between queries. As a consequence, the harvesting process became slower and in some cases lasted more than one day. Slide 12 - Statistics Programs were written to log their operations. Log files allowed to build this table and some graphics. Useful information about harvested data and DDC use in each catalog was analyzed.
  5. 5. Top results are from OCLC Classify, but we have to remark that records, once modified, were no more processed against other Dewey sources. Slide 13 - Browsing Dewey Index The enriched DDC values where added to our catalog browse search. Now, this search path is the most used, more than the author index. Waiting for controlled subject headings, of course. Slide 14 - Software The software involved in the project is listed here. Only developers can appreciate this information, so we can switch to the next slide, not before emphasizing that software libraries, especially open sourced, act like bricks in a building, allowing to write useful tools. Slide 15 - A scientific article The batch mode of this project is published in a scientific article in the current issue of JLIS.it, a peer reviewed academic journal whose editor in chief is professor Mauro Guerrini. It was written by me and my cataloguers (we are very proud of this collaboration), and doesn't deal with the next part of this presentation, since we developed it after the publication. Slide 16 - Version 2: The single record mode The interactive mode was integrated in the ILS, and helps cataloguers to add the DDC number to new records. The basic ideas of the interactive mode are the same of the batch mode. However, the seven Dewey sources are accessed asynchronously or in parallel, heavily reducing the delay.
  6. 6. Slide 17 - Schema of the single record mode When adding the ISBN number, or if present pressing the "Go" button, the search starts and the responses are used to compose the result table, from which a cataloguer can choose a DDC number clicking on it. Subfields "a" and "2", as well as indicators of field 082 are filled in. Also, a new occurrence of field 035 is created to store the system number of the record of the copied DDC number. This tag logically connects two records, in your catalog and in the remote catalog, saying that they are describing the same resource, while tracing the contribute coming from another catalog. Slide 18 - Conclusions More and more bibliographic data are available worldwide, with machine interfaces. Big and linked data are going to heavily change cataloguing modules, OPACs and so on. The catalog enrichment, through unique identifiers, can help to improve bibliographic and authority records to expose them on the net. The most important information to store in a catalog, especially in a linked data environment, are standard identifiers and coded information, thus allowing to retrieve more related information. _______________________________________________________
  7. 7. Slide 17 - Schema of the single record mode When adding the ISBN number, or if present pressing the "Go" button, the search starts and the responses are used to compose the result table, from which a cataloguer can choose a DDC number clicking on it. Subfields "a" and "2", as well as indicators of field 082 are filled in. Also, a new occurrence of field 035 is created to store the system number of the record of the copied DDC number. This tag logically connects two records, in your catalog and in the remote catalog, saying that they are describing the same resource, while tracing the contribute coming from another catalog. Slide 18 - Conclusions More and more bibliographic data are available worldwide, with machine interfaces. Big and linked data are going to heavily change cataloguing modules, OPACs and so on. The catalog enrichment, through unique identifiers, can help to improve bibliographic and authority records to expose them on the net. The most important information to store in a catalog, especially in a linked data environment, are standard identifiers and coded information, thus allowing to retrieve more related information. _______________________________________________________

×