Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

32 ADLUG Meeting - Vitoria-Gasteiz, October 2013
Stefano Bargioni - Pontificia Università della Santa
Croce
Catalog enrichment: importing Dewey Decimal
Classification from external sources
Slide 2 - The project

The project related to this presentation started as a
real need in the library of the Santa Croce Pontifical
University. After migrating to Koha open source ILS
<http://koha-community.org> and adding authority
records for names, name-titles and uniform titles to
the catalog, we studied how to add subject headings.
Even if we were using uncontrolled subject headings
starting from many years ago, we realized that this job
is nor simple neither fast.
Our catalog, in the meanwhile, would have had to be
accessible through other semantic search paths. So we
decided to improve the Dewey Decimal Classification
(DDC). Not manually, of course, but using an
automatic procedure.
Slide 3 - Version 1: The batch mode

Batch mode is the process that scans the catalog and
automatically adds DDC numbers to some records.
This process can span hours or days, depending on
the catalog size and other factors, and is meant to
upgrade old bibliographic records. A similar process
can be applied in interactive mode, for new
bibliographic records. We will discuss it in the last part
of the presentation.

The aim of this process is to enrich the catalog adding
DDC to records, minimizing or even excluding human
interactions. The process will detect records without
the DDC (field 082), but having the ISBN (field 020).
The ISBN, a unique identifier, will be used to discover
the corresponding record on other databases where
the DDC is in use. Big catalogs like national libraries
or bibliographies can be good choices.
However, the DDC is not used everywhere at the same
level of quality or uniformity. This is why we defined
some criteria to check each DDC number before
adding it to the catalog.
Slide 4 - An atomic copy cataloguing

Adding a single field to a record is very similar to copy
cataloguing. We named it "atomic copy cataloguing". It
is based on a unique identifier (often ISBN), and
requires that the you are allowed to programmatically
modify single records in your Integrated Library
System (ILS). This can be done if Advanced
Programming Interfaces (APIs) are available. For
commercial ILS's there can be restrictions or reasons
to avoid this kind of operation.
Since this "atomic copy" is a subset of the copy
cataloguing, we can say that its use is allowed thanks
to the same basic ideas and agreements that govern
full copy cataloguing in the library international
community.
Slide 5 - Records to be modified

This slides shows a way to detect records involved in
the process, extract their system number and ISBN.
We also choose to limit to the language of the work, to
avoid to send lot of unuseful queries to some remote

servers that mostly contain records of their country
language.
Slide 6 - Dewey Sources (I)

OCLC Classify and other six national libraries or
bibliographies were selected as Dewey sources. It is
necessary that they can be accessed using a protocol
like REST, Z39.50 or HTTP, in order to receive
responses in XML, MARC binary or HTML data. The
best is XML, i.e. structured data; the worst is HTML,
because in this case data do not have metadata and
are mixed with or merged into formatting code, and
are difficult to extract.
Slide 7 - Dewey Sources (II): OCLC Classify

Even if it is interesting, we can avoid a detailed
description of Classify OCLC service. It is important to
notice what is highlighted in red color:
• it has a machine interface, and its response is in
XML format;
• ISBN is a keyword to access it;
• it contains 36 million records having the DDC.
Slide 8 - Dewey Sources (III): National Libraries

The table shows six important national libraries we
queried, the language we choose to limit queries, and
the format of responses.
The order followed to query Dewey sources was the
same used in this table, with OCLC Classify as the first
one.
Note that there are no Spanish sources, and we are
very interested to find one, if any.

The logic used in the programs

As usual, we can avoid to enter in details about the
programs we wrote. Red color refers to quality control,
as mentioned before, and the policy we adopted to
avoid overloading. We are going to discuss both.
Slide 10 - Quality check

At the start of the project, our catalog contained DDC
down to edition 19th. In OCLC Classify or Library of
Congress catalogs there are many DDC numbers
belonging to older editions, or the edition is not
specified. Or indicators are not present. We decided to
discard and not accept DDC numbers inconsistent with
our quality standard. Even if many of them were
discarded, we added 50% more DDC numbers to our
catalog. Less strict criteria can lead to add more DDC
numbers, but with lower quality.
Slide 11 - Delay while searching sources

When running, programs can query Dewey targets at
very high rate. This can suffocate the remote server,
especially if it is accessed by other users. Furthermore,
if records of your catalog are continuously modified,
their indexing can overload your ILS.
To avoid that, and ensure you are respecting access
policies sometime set by remote catalogs, we defined
a delay of 5-6 seconds between queries. As a
consequence, the harvesting process became slower
and in some cases lasted more than one day.
Slide 12 - Statistics

Programs were written to log their operations. Log
files allowed to build this table and some graphics.
Useful information about harvested data and DDC use
in each catalog was analyzed.

Top results are from OCLC Classify, but we have to
remark that records, once modified, were no more
processed against other Dewey sources.
Slide 13 - Browsing Dewey Index

The enriched DDC values where added to our catalog
browse search. Now, this search path is the most
used, more than the author index. Waiting for
controlled subject headings, of course.
Slide 14 - Software

The software involved in the project is listed here.
Only developers can appreciate this information, so we
can switch to the next slide, not before emphasizing
that software libraries, especially open sourced, act
like bricks in a building, allowing to write useful tools.
Slide 15 - A scientific article

The batch mode of this project is published in a
scientific article in the current issue of JLIS.it, a peer
reviewed academic journal whose editor in chief is
professor Mauro Guerrini.
It was written by me and my cataloguers (we are very
proud of this collaboration), and doesn't deal with the
next part of this presentation, since we developed it
after the publication.
Slide 16 - Version 2: The single record mode

The interactive mode was integrated in the ILS, and
helps cataloguers to add the DDC number to new
records.
The basic ideas of the interactive mode are the same
of the batch mode. However, the seven Dewey sources
are accessed asynchronously or in parallel, heavily
reducing the delay.

Schema of the single record mode

When adding the ISBN number, or if present pressing
the "Go" button, the search starts and the responses
are used to compose the result table, from which a
cataloguer can choose a DDC number clicking on it.
Subfields "a" and "2", as well as indicators of field 082
are filled in. Also, a new occurrence of field 035 is
created to store the system number of the record of
the copied DDC number. This tag logically connects
two records, in your catalog and in the remote catalog,
saying that they are describing the same resource,
while tracing the contribute coming from another
catalog.
Slide 18 - Conclusions

More and more bibliographic data are available
worldwide, with machine interfaces. Big and linked
data are going to heavily change cataloguing modules,
OPACs and so on. The catalog enrichment, through
unique identifiers, can help to improve bibliographic
and authority records to expose them on the net. The
most important information to store in a catalog,
especially in a linked data environment, are standard
identifiers and coded information, thus allowing to
retrieve more related information.
_______________________________________________________

Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

Similar to Catalog enrichment: importing Dewey Decimal Classification from external sources (text) (20)

More from Stefano Bargioni

More from Stefano Bargioni (6)

Recently uploaded

Recently uploaded (20)

Catalog enrichment: importing Dewey Decimal Classification from external sources (text)