CatConf2001

I name thee Bay of Pe(a)rls : some practical virtues of Perl for cataloguers
Jenny Quilliam

Abstract

With the increasing numbers of aggregated electronic resources, libraries now tend to ‘collect in batches’.
These aggregated collections may not be permanent and are subject to frequent and significant frequent content
changes. One survival strategy for Cataloguers is to ‘catalogue in batches’. While some publishers and vendors
are now supplying files of MARC records for their aggregated resources, these often need to be adapted by
libraries to include local authentication and access restriction information.

Perl (Practical Extraction and Reporting Language) – is an easy to learn programming language which was
designed to work with chunks of text – extracting, pattern matching / replacing, and reporting. MARC records
are just long strings of highly formatted text and scripting with Perl is a practical way to edit fields, to add local
information, change subfields, delete unwanted fields etc. – any find-and-replace or insert operation for which
the algorithm can be defined.

As cataloguers are already familiar with MARC coding and can define the algorithms, learning a bit of Perl
means that cataloguers can easily add a few strings of Perls to their repertoire of skills

Introduction

In reviewing the literature on current and future roles for cataloguers, two major themes emerge: cataloguers
need to be outcomes focussed and that new competencies are required to address the challenges in providing
bibliographic access control for remote-access online resources.

Electronic resources – primarily fulltext electronic journals and fulltext aggregated databases – have
significantly improved libraries’ ability to deliver content to users regardless of time and distance. Integrated
access means that the library catalogue must reflect all the resources that can be accessed especially those that
are just a few clicks away. Macro cataloguing approaches are needed to deal with the proliferation of electronic
resources and the high maintenance load caused by both long-term and temporary associated content volatility
of these resources.

In the United States, the Federal Library and Information Center Committee’s Personnel Working Group
(2001) is developing Knowledge, Skills and Abilities statements for its various professional groups. For
Catalogers, it has identified abilities including:

• Ability to apply cataloging rules and adapt to changing rules and guidelines
• Ability to accept and deal with ambiguity and make justifiable cataloguing decisions in the absence of
clear-cut guidelines
• Ability to create effective cataloging records where little or not precedent cataloguing exists

Anderson (2000) argues that without decrying the importance of individual title cataloguing, macro-
cataloguing approaches to manage large sets of records are essential. Responsibility for managing quality
control, editing, loading, maintaining and unloading requires the “Geek Factor”. In a column which outlined
skills required for librarians to manage digital collections, Tennant (1999) observed that while digital librarians
do not need to be programmers, it is useful to know one’s way around a programming language and while the
specific languages will vary a “general purpose language such as Perl can serve as a digital librarian’s Swiss
Army knife – something that can perform a variety of tasks quickly and easily”.

What is Perl and why is it useful?
Perl is the acronym for Practical Extraction and Report Language. It is a high-level interpreted language
optimized for scanning arbitrary text files and extracting, manipulating and reporting information from those
text files. Unpacking this statement:
• high-level = humans can read it
• interpreted = doesn’t need to be compiled and is thus easier to debug and correct
• text capabilities = Perl handles text in much the same way as people do

Perl is a low cost – free - scripting language with very generous licensing provisions. To write a Perl script all
you need is any text editor – e.g. Notepad or Arachnaphilia - as Perl scripts are just plain text files

Perl is an outcomes focussed programming language – the ‘P” in Perl means practical and it is designed to get
things done. This means that it is complete, easy to use and efficient. Perl uses sophisticated pattern-matching
techniques to scan large amounts of data very quickly and it can do tasks that in other programming languages
would be more complex, take longer to write, debug and test. There are often many ways to accomplish a task
in Perl.

Perl is optimized for text processing – and this is precisely what is required in creating editing and otherwise
manipulating MARC records. A word of caution - while Perl is more forgiving that many other programming
languages, there is a structure and syntax to be observed – in many ways familiar territory to cataloguers who
deal with AACR2R and MARC rules, coding and syntax.

Resources for learning Perl

There are many how-to books on Perl. If you have no previous programming knowledge, two introductory texts
are Paul Hoffman’s Perl 5 for dummies or Schwartz & Christiansen’s Learning Perl. Both are written in a
gentle tutorial style, with comprehensive indexes and detailed tables of contents. Another useful resource is the
Perl Cookbook, which contains around 1000 how-to-recipes for Perl – giving firstly the quick answer followed
by a detailed discussion of the answer to the problem

For online resources, an Internet search on the phrase ‘Perl tutorial’ yields pages of results. Two examples of
beginner level tutorials are Take 10mins to learn Perl and Nik Silver’s Perl tutorial.

How much Perl is needed to manipulate MARC records?

The good news is “not a lot” – there are a number of tools available to deal with the more challenging
intricacies of the MARC format – the directory structure and offsets, field and subfield terminators etc. These
MARC editing tools (discussed below) allow you to deal with MARC records in a tagged text format rather
than the single string. Not only is a tagged text format much easier to read (for humans) but it can be easily
updated and manipulated using simple Perl scripts.

Certainly to create a useful Perl script need to learn how to open files for reading and writing, something about
control structures, conditionals and pattern matching and substitution.

MARC record tools

There are a range of MARC editing tools available for use and the Library of Congress maintains a listing of
MARC Specialized Tools at: http://lcweb.loc.gov/marc/marctools.html

MARCBreaker is a Library of Congress utility for converting MARC records into an ASCII text file format. It
has a complimentary utility, MARCMaker, which can then be used to reformat from this file format into
MARC records. The current version only runs under DOS and Windows 95/98. There is also a companion

MarcEdit utility to MARCBreaker/MARCMaker developed by Terry Reese (2001). MarcEdit is currently in
version 3.0 and has a number of useful editing features including global field addition and deletion.
Simon Huggard and David Groenewegen in their paper ‘E-data management: data access and cataloguing
strategies to support the Monash University virtual library’ outline the use of MARCBreaker and MARCMaker
to edit record sets for various database aggregates. The Virtual University of Virginia (VIVA) has also used
MARCMaker together with the MARC.pm module to convert and manipulate MARC records for electronic
texts.

MARC.pm is Perl 5 module for preprocessing, converting and manipulating MARC records. SourceForge
maintains an informative website for MARC.pm that includes documentation with a few examples. It is a
comprehensive module that can convert from MARC to ASCII, HTML, and XML and includes a number of
‘methods’ with options to create, delete and update MARC fields. Using MARC.pm requires a reasonable
knowledge of Perl and general programming constructs. MARC.pm is used by the JAKE project to create
MARC records. Michael Doran, University of Texas at Arlington, uses MARC.pm together with Perl scripts to
preprocess MARC records for netLibrary. A description of this project can be found at:
http://rocky.uta.edu/doran/preprocess/process.html

marc.pl is a Perl module written by Steve Thomas from Adelaide University. It is a utility for extracting record
from a file of MARC records, and converting records between standard MARC format and a tagged text
representation and vice-versa from tagged text to MARC. One of the best features of this utility is the ability to
add tags globally to each record by the use of a globals tagged text file. The marc.pl utility with documentation
is available for download at: www.library.adelaide.edu.au/~sthomas/scripts/

It uses command line switches to specify the output format and options to include a global file or skip records.
By default, marc.pl creates serial format MARC records, Leader ‘as’ so it is particularly suited to creating
records for electronic journals in aggregated databases and publisher collections. The tagged text format
required by marc.pl is simple – each field is on a separate line, the tag and indicator information is separated by
a space and subfields are terminated with a single dagger delimiter. Records are separated by a blank line.

To use marc.pl it is helpful to know what Perl is and this is why I first dived [paddled is probably a more
accurate verb] into the world of Perl. Once in though, it is easy to learn enough to write simple Perl scripts.

Scenarios for Perl scripting with MARC records

Three scenarios where Perl scripting is used for cataloguing purposes:
• Creating brief MARC records from delimited titles lists
• Editing vendor-supplied MARC record files to adapt for local requirements
• Deriving MARC records for ejournals based on the print version.

The Final report of the Program for Cooperative Cataloging’s Task Group on Journals in Aggregator Databases
(2000) provides a useful checklist of appropriate tags and content when scripting to either create or derive
MARC records. It lists proposed data elements for both machine-generated and machine-derived (i.e. from
existing print records) aggregator analytics records

Depending on whether there is an existing file of MARC records the records creation/manipulation process
steps are:

1. Convert from MARC to tagged text using marc.pl or capture vendors delimited titles, ISSN, coverage file
1. Edit tagged text using a locally written Perl script
2. Create a globals tagged text file for fields, including a default holdings tag, to be added to each record
3. Convert from tagged text to MARC using marc.pl
4. Load resulting file of MARC records to the library system

Creating brief MARC records from delimited titles lists

When no MARC record set exists for an aggregated database, Perl scripts are used to parse delimited titles,
ISSN, coverage and URL information into MARC tagged text. The resulting tagged text file is then formatted
to MARC incorporating a global tagged text file using marc.pl to create as set of records.

In brief, all the Perl script has to do is to open the input file for reading, parse the information into the
appropriate fields, format it as tagged text and write the tags to an output file. This approach has been used to
create records for several databases including IDEAL, Emerald, Dow Jones Interactive and BlackwellScience.
For some publisher databases, fuller records with subject access have been created by adding one or more
subject heading terms for each title in the delimited titles file.

Appendix 1 shows the simple Perl script written to process Emerald records. Appendix 2 shows an example of
the resulting tagged text together with the global file used for Emerald.

Editing Vendor-supplied MARC records

Database vendors now make available files of records for their various aggregated databases. EBSCO
Publishing had undertaken a pilot project for the PCC Task Group on Aggregator Databases to derive records
for aggregated databases and their records are freely available to subscribers. When the University of South
Australia subscribed to the Ebsco MegaFile offer in late 1999, the availability of full MARC records was
regarded as a definite advantage. However these records required preprocessing to include UniSA-specific
information, change the supplied title level URLs to incorporate Digital Island access, and add a second URL
for off campus clients. Additional edits include changing GMD from [computer file] to [electronic journal] and
altering subject headings form subdivision coding from ‘x’ to ‘v’. Again to enable bulk deletion for
maintenance purposes, a tag to create a default holding was required.
The Perl scripts for these files do string pattern matching and substitution or [semi-global] find-and-replace
operations. In many cases, these changes could be done with a decent text editor with find/replace capabilities
and if dealing with the records on a one-off basis this is practical process. However aggregator databases are
notoriously volatile – changing content frequently – and hence the record sets need to be deleted and new files
downloaded from the vendor site, edited and loaded to the library system. So it’s worth spending a little time to
write a custom Perl editing script. Appendix 3 shows a script to edit Ebsco-sourced records.

Until mid-2000, Ebsco did not include publisher’s embargo periods in their MARC records but maintained a
separate embargoes page – hence further scripting to incorporate this information was needed. Vendor MARC
records are also available for the Gale and Proquest databases.

A variation of this process is also used to preprocess netLibrary MARC records – adding a default holding,
second remote authentication URL, and to edit the GMD.

Deriving MARC records for ejournals from print records
The third scenario where Perl scripts are used with MARC records is deriving records for the electronic version
from existing records. At UniSA we have reworked existing MARC records for print titles to create ejournal
records for APAIS FullText. No records were available as ejournals and as we already had print records for a
majority of titles, it was decided to rework these records into ejournal records. Title, ISSN and coverage
information was captured from the Informit site and edited into a spreadsheet. During the pre-subscription
evaluation process, APAIS FullText titles had been searched to the UniSA catalogue and bibkeys of existing
records noted. MARC records for these titles were exported from the catalogue as tagged text. For the titles not
held at UniSA, bibliographic records were captured to file from Kinetica and then converted to tagged text. The
ISSN and coverage data was also exported in tab-delimited format from the spreadsheet. By matching on ISSN,
the fulltext coverage information could be linked to each title and incorporated into the MARC record.

The records were edited following the PCC’s (2000) proposed data elements for machine-derived records –
deleting unwanted fields, adding and editing fields as needed. A globals file was used to add tag 006 and 007
data, tag 530 additional physical form note, a 590 local access information note, a 773 Host-item entry for the
database, a 710 for the vendor Informit and a default local holdings tag.

The Perl script to process records is longer than the earlier examples but no more complex – it just does more
deleting, updating and reworking. Appendix 4 shows an example of a print record for APAIS Fulltext – the
original print form, the edited form, the globals file and the final record as an ejournal.

Conclusion

While Perl is currently mostly used to deal with the challenges of providing and maintaining MARC records
for electronic resources, scripts are also used to post-process original cataloguing for all formats for batch
uploading to Kinetica. The uses of Perl in the cataloguer’s toolkit can be many and varied – it is a not-so-little
language that can and does! And it’s fun!

Appendix 1 – Perl script to edit Emerald titles file
# !/usr/local/bin/perl
# Script to edit Emerald tab-delimited title file into tagged text
# Entries contain Title, ISSN, Coverage and specific URL
# Written: Jenny Quilliam Revised: August 2001
# Command line >perl Emerald_RPA.pl [INPUT FILE] [OUTPUT FILE]
#
#################################################################################

$TheFile = shift;
$OutFile = shift;

open(INFILE, $TheFile) or die "Can't open Inputn";
open(OUTFILE, ">$OutFile") or die "Can't open Outputn";

# control structure to read and process each line from the input file
while (<INFILE>)
{
s/"//g ; #deleting any quote marks from the string
$TheLine = $_ ;

chomp($TheLine);

#parsing the contents at the tab delimiters to populate the variables
($ISSN, $Title, $Coverage, $URL) = split(/t/, $TheLine);

#printing out blank line between records
print OUTFILE "n";

# processing ISSN
print OUTFILE "022 |a$ISSNn" ;

# processing Title - fixing filing indicators
# checking for leading The in Title
if($Title =~ /^The /)
{print OUTFILE "245 04|a$Title|h[electronic journal]n"; }
else
{print OUTFILE "245 00|a$Title|h[electronic journal]n";}

# processing to generate URL tag with Coverage info
print OUTFILE "856 40|zFulltext from: $Coverage.";
print OUTFILE "This electronic journal is part of the Emerald database.";
print OUTFILE " Access within University network.|u$URLn";

# adding generic RPA URL link to all records
print OUTFILE "856 41|zAccess outside University network.";
print OUTFILE
"|uhttp://librpa.levels.unisa.edu.au/rpa/webauth.exe?rs=emeraldn";
}

close(INFILE);
close(OUTFILE);

Appendix 2 – Global and example of tagged text for Emerald titles
006 m d
007 cr cn-
008 001123c19uu9999enkuu p 0 a0eng d
040 |aSUSA|beng|cSUSA
260 |a[Bradford, England :|bMCB University Press.]
530 |aOnline version of the print publication.
590 |aAvailable to University of South Australia staff and students. Access is
by direct login from computers within the University network or by authenticated
remote access. Articles available for downloading in PDF and HTML formats.
773 0 |tEmerald
991 |cEJ|nCAE|tNFL
___________________________________________________________________________
001 jaq00-05205
245 00|aAsia Pacific Journal of Marketing & Logistics|h[electronic journal]
022 |a0945-7517
856 40|zFulltext from: 1998. This electronic journal is part of the Emerald
library database. Access within University network.
|uhttp://www.emeraldinsight.com/094-57517.htm
856 41|zAccess outside University network.
|uhttp://librpa.levels.unisa.edu.au/rpa/webauth.exe?rs=emerald

Appendix 3 – Perl script to edit Ebsco sourced records
# !/usr/local/bin/perl
#
# Author: Jenny Quilliam November 2000
#
# Program to edit EbscoHost records [as converted to text using marc.pl]
# GMD to be altered to: electronic journal
# Form subfield coding to be altered to v
# French subject headings to be deleted
# Fix URL to incorporate Digital Island access
# Command line string, takes 2 arguments:
# Command line: mlx> perl EHedit.pl [input filename] [output filename]
#############################################################################

$TheFile = shift;
$OutFile = shift;
open(INFILE, $TheFile) or die "Can't open inputn";
open(OUTFILE, ">$OutFile") or die "Can't open outputn";
while (<INFILE>)
{
$TheLine = $_ ;
# processing selected lines only

# editing the GMD in the 245 field from [computer file] to [electronic journal]
if($TheLine =~ /^245/) { $TheLine =~ s/computer file/electronic journal/g;} #

# editing subject headings to fix form subdivision subfield character
if($TheLine =~ /^65/) { $TheLine =~ s/xPeriodicals/vPeriodicals/g;}

# editing out French subject headings
if($TheLine =~ /^650 6/) {next}

# editing URL to add .global to string for Digital Island address
if($TheLine =~ /^856/) {$TheLine =~ s/search.epnet/search.global.epnet/g ;}

print $TheLine;
print OUTFILE $TheLine;
}
close(INFILE);
close(OUTFILE);

Appendix 4 – APAIS FullText examples
Print record
LDR 00824nas 2200104 a 4500
001 dup91000065
008 820514c19739999vrabr p 0 0 0eng d
022 0 $a0310-2939
035 $a(atABN)2551638
035 $u145182
040 $dSIT$dSCAE
043 $au-at---
082 0 $a639.9$219
245 00 $aHabitat Australia.
259 00 $aLC$bP639.9 H116$cv.2, no.1 (Mar. 1974)-
260 01 $aHawthorn, Vic. :$bAustralian Conservation Foundation,$c1973-
300 $av. :$bill. (some col.), maps ;$c28 cm.
362 0 $aVol. 1, no. 1 (June 1973)-
580 $aAbsorbed Peace magazine Australia. Vol 15, no. 4 (Aug. 1987)
650 0 $aNatural resources$xResearch$zAustralia.
650 0 $aConservation of natural resources$zAustralia.
710 20 $aAustralian Conservation Foundation.
780 05 $tPeace magazine Australia$x0817-895X
984 $a2036$cCIT PER 304.2 HAB v.1 (1973)-$cUND PER 304.2 HAB v.1 (1973)-$cMAG
PER 333.9506 H116 v.1 (1973)-$cSAL PER 333.705 H11 v.1 (1973)-
EndRecord

Edited record

LDR 00824nas 2200104 a 4500
001 jaq01-0607
008 820514c19739999vrabr p 0 0 0eng d
022 0 |a0310-2939
082 0 |a639.9|219
245 00|aHabitat Australia|h[electronic journal].
260 |aHawthorn, Vic. :|bAustralian Conservation Foundation,|c1973-
362 0 |aVol. 1, no. 1 (June 1973)-
580 |aAbsorbed Peace magazine Australia. Vol 15, no. 4 (Aug. 1987)
650 0|aNatural resources|xResearch|zAustralia.
650 0|aConservation of natural resources|zAustralia.
710 2 |aAustralian Conservation Foundation.
780 05|tPeace magazine Australia|x0817-895X
856 41|zSelected fulltext available: Vol. 24- (June 1996-) .Access via Australian
public affairs full text.|uhttp://www.informit.com.au
991 |cEJ|nCAE|tNFL

Globals file
006 m d
007 cr anu
040 |aSUSA
530 |aOnline version of the print title.
590 |aAvailable to University of South Australia staff and students. Access is
by direct login from computers within the University network or by login and
password for remote users. File format and amount of fulltext content of journals
varies.
710 2 |aInformit.
773 0 |tAustralian public affairs full text|dMelbourne, Vic. : RMIT Publishing,
2000-.
991 |cEJ|nCAE|tNFL

References

Anderson, B., 1999, ‘Cataloging issues’ paper presented to Technical Services Librarians: the training we
need, the issues we face, PTPL Conference 1999. http://www.lib.virginia.edu/ptpl/anderson.html

Christiansen, T. & Torkington, N.,1998, Perl cookbook, O’Reilly, Sebastapol CA.

FLICC Personnel Working Group (2001) Sample KSAs for Librarian Positions: Catalogers
http://www.loc.gov/flicc/wg/ksa-cat.html

Hoffman, P. 1997, Perl 5 for dummies, IDG Books, Foster City CA.

Huggard, S. & Groenewegen, D., 2001, ‘E-data management: data access and cataloguing strategies to support
the Monash University virtual library’, LASIE, April 2001, p.25-42.

Library of Congress’s MARCBreaker and MARCMaker programs available at:
http://lcweb.loc.gov/marc/marc/marctools.html

Program for Cooperative Cataloging Task Group on Aggregator Databases, 2000, Final report.
http://lcweb/loc/gov/catdir/pcc/aggfinal.html

Reese, T. MarcEdit 3.0 program available at: http://ucs.orst.edu/~reeset/marcedit/index.html

Schwartz, R. & Christiansen, T. 1997, Learning Perl, 2nd ed., O’Reilly, Sebastapol CA.

Silver, Nik, Perl tutorial. http://fpg.uwaterloo.ca:80/perl/

Take 10 min to learn Perl http://www.geocities.com/SiliconValley/7331/ten_perl.html

Tennant, R. 1999 ‘Skills for the new millenium’, LJ Digital, January 1, 1999.
http://www.libraryjournal.com/articles/infotech/digitallibraries/19990101_412.htm

Thomas, S. marc.pl utility available at: http://www.library.adelaide.edu.au/~sthomas/scripts/

Using MARC.pm with batches of MARC records : the VIVA experience, 2000. [Online]
http://marcpm.sourceforeg.net/examples/viva.html

Author
Jenny Quilliam
Coordinator (Records)
Technical Services
University of South Australia Library
Email: jenny.quilliam@unisa.edu.au

CatConf2001

Recommended

Recommended

More Related Content

Similar to CatConf2001

Similar to CatConf2001 (20)

More from tutorialsruby

More from tutorialsruby (20)

Recently uploaded

Recently uploaded (20)

CatConf2001