Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

Clusters from Outer Space
Primo Deduping and FRBRizing in Context and Reality
Laura Akerman, Nathalie Schulz, Amelia Rowe
With help from Lukas Koster
IGELU Annual Meeting September 12 2017 St. Petersburg, Russia

1. Why do librarians bring things together?
It’s called “collocation”...

Cutter’s rules for a dictionary catalog, 1875

Functional Requirements for Bibliographic Records, 1991
The study uses an entity analysis
technique that begins by isolating the
entities that are the key objects of interest
to users of bibliographic records. The
study then identifies the characteristics or
attributes associated with each entity and
the relationships between entities that
are most important to users in
formulating bibliographic searches,
interpreting responses to those searches,
and “navigating” the universe of entities
described in bibliographic records.
IFLA Study Group on the Functional Requirements
for Bibliographic Records. Functional
Requirements for Bibliographic Records. K . G.
Saur München 1998

It’s all about what users do
● using the data to find materials that correspond to the user’s stated search criteria (e.g., in the context of a search
for all documents on a given subject, or a search for a recording issued under a particular title);
● using the data retrieved to identify an entity (e.g., to confirm that the document described in a record corresponds to
the document sought by the user, or to distinguish between two texts or recordings that have the same title);
● using the data to select an entity that is appropriate to the user’s needs (e.g., to select a text in a language the user
understands, or to choose a version of a computer program that is compatible with the hardware and operating
system available to the user);
● using the data in order to acquire or obtain access to the entity described (e.g., to place a purchase order for a
publication, to submit a request for the loan of a copy of a book in a library’s collection, or to access online an
electronic document stored on a remote computer).
IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur
München 1998

FRBR Work
work: a distinct intellectual or artistic creation.*
● An abstract entity - no one material item to point to
● Recognized in realizations or expressions;
● Work is the commonality of content between and among various expressions (example: Homer’s
Illiad)
● Sometimes difficult to define boundaries; differences may be cultural.
IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur
München 1998

FRBR Expression
expression: the intellectual or artistic realization of a work in the form of alpha-
numeric, musical, or choreographic notation, sound, image, object,
movement, etc., or any combination of such forms
● Any change in intellectual or artistic content constitutes a new expression
● Change in form (e.g. from alphanumeric to spoken word) - new expression
● Changes in physical form (e.g. typeface) are not new expression
● Examples of new expressions - translation
● My own “layman’s” term would be “version”

2. How do librarians bring things together?
Technology made this change..

Card Catalog - linear arrangement
● Various ways of
organizing cards but
principle of bringing
together the various
versions of a work.
● “Deduping” could be
adding call numbers for
print and microform to
same card.
A. L. A. rules for filing catalog cards
1942. "Second printing, with
corrections,April, 1943”
https://catalog.hathitrust.org/Record
/002433836

Here you see in (b)
alternative rule,
something like the
origin of “uniform title”
concept - organizing all
translations via a
heading for the original
title and language.

California Digital Library “dedup” algorithm
“DLA merges book format records through a complex algorithm that assigns numeric "weights" for
matches on different parts of the bibliographic record. When the total of these weights reaches a certain
level, the records are considered to be sufficiently alike to warrant bringing them together as a single
database record. If the total weight does not reach this level, the records are not merged.
Not all data elements have to match exactly for the records to be merged. The use of weighting means
that some variation between the records can be tolerated, as long as the overall score is high enough to
be considered a match.”
Coyle, Karen. Technical Report No. 6 RULES FOR MERGING MELVYL(R) RECORDS* Revised June 1992 (copy
provided privately).
See also Coyle, Karen, and Linda Gallaher-Brown. "Record matching: an expert algorithm." ASIS'85: Proceedings of the
American Society for Information Science (ASIS) 48th Annual Meeting. Vol. 22. 1985.

Other approaches
● VTLS Cataloging system based on FRBR entities
https://www.slideshare.net/VisionaryTechnology/vtls-8-years-experience-with-
frbr-rda-4755109
● WorldCat Work Descriptions: http://www.oclc.org/developer/develop/linked-
data/worldcat-entities/worldcat-work-entity.en.html

3. Ex Libris’s dedup and FRBR algorithms in Primo

Primo Dedup ...
● Derived from California Digital Library algorithm.
● Roughly equivalent to FRBR “Expression” level - edition of a book, director’s
cut of a movie, recording of a symphony by a particular orchestra on a certain
date
● Should bring together issuances of same content in different formats - print,
electronic, microform, etc. (manifestations)

Primo Dedup merged record
● Provides a merged record PNX - selecting one description out of the “dups”, then adding from all the
records:
○ local fields,
○ holdings/items from all the records.
● Primo’s selection of “preferred record” is based on the “delivery category” assigned by the Primo
norm rules. Current hierarchy is:
○ SFX resource
○ Electronic resource
○ Metalib resource
○ Physical item

Dedup - matching up “dups”
● Assign a “score” based on full or partial matching of selected fields, as indicated in the
“dedup” section of the PNX (created by normalization rules)
● Same field, different rules for serials, for articles, and for everything else
● If score meets target number, it’s a match.
● The Primo ingest pipe calculates match scores for every incoming record and assigns a
match ID associated with matching records. It also removes deleted records from a
match ID cluster, and adds or removes records to a match ID if their score changes.
● If changes are made to the dedup normalization rules, the records would need to be
updated (renormalization pipe or reload from source) to change.
● “Force dedup” setting on a renormalization pipe might be needed if you tinkered with

File CDLMatchingProfile can be edited
<handler id="CDLID">
<fieldID>f1,f2,f3,f4</fieldID>
<name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDComparator</name>
<arguments>
<argument name="recID_match">+200</argument>
<argument name="recID_recIDInvalid_match">+100</argument>
<argument name="recIDInvalid_match">+50</argument>
<argument name="recID_mismatch">-470</argument>
<argument name="recID_recIDInvalid_mismatch">-50</argument>
<argument name="ISBN_match">+85</argument>
<argument name="ISBN_ISSN_match">+30</argument>
<argument name="ISSN_ISSN_match">+10</argument>
<argument name="ISSN_ISBN_mismatch">-225</argument>
https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Technical_Guide/050Matching_Records_in_the_Serials_and_Non
-Serials_Dedup_Algorithm/020Customizing_the_Dedup_Algorithms
Path: primo/p4_1/ng/primo/home/profile/publish/publish/production/conf/

Normalization Rules can be modified

Dedup Test analyzes why 2 records do or don’t
dedup

5. Dedup at Emory Libraries (Laura)
● When we first implemented Primo in 2008-9, we experimented with FRBR but decided it was too
confusing for users. But we wanted dedup.
● Intent of dedup was to bring print, microform, electronic etc. versions of the same content together.
● Our big concern at implementation was, we were creating very brief records for electronic serials
from SFX, and they became the “merge record” and our lovely print CONSER full serial records
disappeared. Our solution at that time was to add 856 URLs to the serial records, making the print
record “electronic” to Primo’s norm rules, which put it on equal footing in the choice of merge record.
This was too much manual work.
● With Alma, things are better for e-serials; Community Zone e-journal records are fuller, so we can
choose fuller records for e-serials in Alma.
● From time to time when we have had dedup problems, Ex Libris support staff have suggested we
just use FRBR instead, but we have re-evaluated it and decided “no”.

The algorithm isn’t friendly to rare book cataloging.
The first edition and some of the
rare editions of this book were
deduping together.
Why? Dates...

Solution? Exclude entire library or location where
the rare stuff lives
(Screenshot of norm rules)

Rare books in microform collection still dedup

Other side effects -
Our digitized books from the Rose Library Special collections (not in Alma) no
longer dedup with the source physical book records from Alma - even though we
retained the record ID in the digital metadata.

Why?
No identifiers in separate records that could break the dedup
245 (Title) subfield p (part) or v (volume) for volume number doesn’t have enough
weight to lower the score enough.

Solution - not nice
Add the MMSID for
each Alma record
for the 12 volumes
to the Dedup rule
so it will get a t99
“do not dedup”
value.

Same title same year (work in progress?...)
Both published 1999. Same composer and work (Chopin, Piano
Concertos nos. 1 & 2)
Artists (Arthur Rubinstein, Martha Argerich) are not part of dedup
algorithm!
● Ideas: Add mapping of 024 or 028 or 037 - (publisher numbers, repeatable, not consistently formatted, not
“universal”) as Universal ID (F1)
● Support suggested: Add record ID to F1 (Universal ID) as the last “or” choice, to subtract points/prevent dedups
The American movie directed by Steven Segal and the Chinese
language movie directed by Corey Yuen with the same title
were issued in 2009. I couldn’t find a thumbnail of our copy of
the Yuen movie which is a Videodisc.

6. More Dedup problems at RMIT University
(Amelia)
Genki
● Two records with a single number different in the titles
● Number displayed in roman numerals I and II
● Primo was deduping the records and only displaying title metadata related to
Genki II
● Users couldn’t find Genki I

Screenshot of the DeDup
test in Primo BO.
This is how we identified
the title field was
matching.

Solution = Changed roman numerals in title (245 $a) to numerical representation
For example: 246 $a Genki 2

More Dedup problems (Amelia)
Only Dedup within a pipe

7. Primo “FRBR clustering” (Nathalie)
● Simpler algorithm
● Uses author-title (or title only) keys to create clusters of records for a work.
● In the FRBRization part of a pipe, if a match is found based on the keys the
record is added to the same FRBR group.

FRBR matching
FRBR vector (simplified explanation)
K1 - Author part key (Fields 100 or 110 or 111 OR 700, 710, 711)
K2 - Title only key (Field 130)
K3 - Title part key (Not Serials: 240 and 245; Serials: 240 or if does not exist 245)
● Not all subfields are used.
● Normalization to remove punctuation, change to lowercase, etc.
● K1 and K3 are combined for matching, K2 is not.

FRBR problems (Nathalie, Bodleian Libraries)
● Records that you want to cluster, that don’t
● Records that cluster, that you don’t want to
● Sort order within clusters
(Examples are from http://solo.bodleian.ox.ac.uk - which has FRBR turned on, but
not dedup)

FRBR problems (Nathalie)
● Records that you want to cluster, that don’t. Example 1

FRBR section of the PNX records
Print record
<k3>$$Kjournal of women politics and policy$$AT</k3>
Key used for matching: none
Electronic records
<k2>$$Kjournal of women politics & policy online$$ATO</k2>
<k3>$$Kjournal of women politics and policy$$AT</k3>
Key used for matching: journal of women politics & policy online

9th and 10th editions:
<k1>$$Kroberts harry$$AA</k1>
<k3>$$Kriley on business interruption insurance$$AT</k3>
Key used for matching: riley on business interruption insurance~roberts harry
7th and 8th editions:
<k1>$$Kcloughton david$$AA</k1>
<k1>$$Kriley denis$$AA</k1>
<k3>$$Kriley on business interruption insurance$$AT</k3>
Keys used for matching: riley on business interruption insurance~cloughton david
riley on business interruption insurance~riley denis

Print record - incorrect metadata! (24514 $aThree sisters)
<k1>$$Kcaldwell lucy 1981$$AA</k1>
<k3>$$Ke sisters$$AT</k3>
Electronic Record
<k1>$$Kcaldwell lucy 1981$$AA</k1>
<k3>$$Kthree sisters$$AT</k3>

● Records that cluster, that you don’t want to
○ This is subjective!
○ The normalization rules can be used to exclude records from clustering by assigning
“<t>99</t>”
● Oxford case-study
○ Excluded from clustering: printed maps, printed music, sound recordings, video recordings,
computer software, and printed books prior to 1830.
○ Individual records can also be excluded by adding a local field to the Aleph record (which is
used by the normalization rules).

● Sort order within clusters
○ Set in the Back office.
● Oxford case-study
○ At Oxford we have chosen relevance as that works best for people doing known item searches
as the result they want will usually be the first record in the cluster.
○ However, Date-newest would be preferable in some situations (e.g. multiple editions of a text
book)
○ Sometimes the most “relevant” record is not what you would expect ….

8. FRBR problems (Amelia)
FRBR not occurring unexpectedly - such as minor differences in cataloging

Solution (to be implemented)
Add transformations to Normalization rules - FRBR Section
(thank-you Nathalie for the solution to this problem)

More FRBR problems (Amelia)
Tecnica dei modelli
● Fashion series split into 3 volumes
● Each volume has it’s own Alma record
● Primo was clustering the records and only displaying the $n information for
volume 3 in the search results
● Users couldn’t find volumes 1 and 2

Solution:
Add t=99 for records
with the series title 240
$a Tecnica dei modelli
Preventing FRBR (Amelia)

Other FRBR problems (Amelia)
● User understanding
○ How much do users understand about clustering?
○ How much do they need to know?
● Staff training requirements
○ How much do staff understand about clustering?
○ How much do they need to know?
■ Enough to help the users

Above: Screenshot of deduped item in Classic UI
Below: Screenshot of deduped item in New UI
DeDup : Classic Primo and New Primo

FRBR : Classic Primo and New Primo
Above: Screenshot of clustered item in Classic UI
Below: Screenshot of clustered item from New UI

Summary of issues with Primo
● 245 $n and $p not given enough weight
● Inability to DeDup or Cluster across all collections (example: Alma and PCI)
● Matching depends on textual strings in the metadata - this can have errors or
legitimate variations
● Deduping should not happen for rare book cataloging
● Lack of control on choice of the “merged record” for Deduping
● Lack of reliable identifiers in records especially for media….
● Lack of control...

The Future...
● New field approved to be added to MARC for work identifiers (URIs): 758
● Linked Data! If you define an Entity… it must have an Identifier (URI: URL
or URN).
● RDA/FRBR “Work” vs BIBFRAME “Work” (RDA Expression?)
● Not clear where the overlaps or agreements are in version 2.0
● BIBFRAME still being refined

Questions:
How might we address problems with deduping and FRBR clustering?
Should the algorithms be modified?
Should Work and Expression identifiers be generated on-the-fly in Alma and
Primo, or be generated once, be stored and be editable?
Is Primo Dedup merged display best for users? What other approaches might
work better?

Contacts:
Laura Akerman, Discovery Systems and Metadata Librarian, Emory University
liblna@emory.edu
Nathalie Schulz, Systems Analyst, Bodleian Libraries, University of Oxford
Nathalie.Schulz@bodleian.ox.ac.uk
Amelia Rowe, Applications Librarian, RMIT University
amelia.rowe2@rmit.edu.au

Credits:
● Opening image: NASA, Hubble Space Telescope image, Gas Clouds and Star Clusters, NGC 1850.jpg
● Image from Cutter, Charles A.,1837-1903, Rules for a printed dictionary catalogue. Washington: Government
Printing Office, 1875, retrieved from Hathi Trust, https://catalog.hathitrust.org/Record/009394960
● Frank Sinatra and Martha Argerich album cover and Above the Law (Segal) DVD cover thumbnails from
Amazon.com
● Artur Rubinstein album cover thumbnail from Discogs.com
● Above the Law (Yuen) DVD thumbnail from Internet Movie Database

Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

Similar to Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality (20)

Recently uploaded

Recently uploaded (20)

Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

Editor's Notes