SlideShare a Scribd company logo
Clusters from Outer Space
Primo Deduping and FRBRizing in Context and Reality
Laura Akerman, Nathalie Schulz, Amelia Rowe
With help from Lukas Koster
IGELU Annual Meeting September 12 2017 St. Petersburg, Russia
1. Why do librarians bring things together?
It’s called “collocation”...
Cutter’s rules for a dictionary catalog, 1875
Functional Requirements for Bibliographic Records, 1991
The study uses an entity analysis
technique that begins by isolating the
entities that are the key objects of interest
to users of bibliographic records. The
study then identifies the characteristics or
attributes associated with each entity and
the relationships between entities that
are most important to users in
formulating bibliographic searches,
interpreting responses to those searches,
and “navigating” the universe of entities
described in bibliographic records.
IFLA Study Group on the Functional Requirements
for Bibliographic Records. Functional
Requirements for Bibliographic Records. K . G.
Saur München 1998
It’s all about what users do
● using the data to find materials that correspond to the user’s stated search criteria (e.g., in the context of a search
for all documents on a given subject, or a search for a recording issued under a particular title);
● using the data retrieved to identify an entity (e.g., to confirm that the document described in a record corresponds to
the document sought by the user, or to distinguish between two texts or recordings that have the same title);
● using the data to select an entity that is appropriate to the user’s needs (e.g., to select a text in a language the user
understands, or to choose a version of a computer program that is compatible with the hardware and operating
system available to the user);
● using the data in order to acquire or obtain access to the entity described (e.g., to place a purchase order for a
publication, to submit a request for the loan of a copy of a book in a library’s collection, or to access online an
electronic document stored on a remote computer).
IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur
München 1998
FRBR Work
work: a distinct intellectual or artistic creation.*
● An abstract entity - no one material item to point to
● Recognized in realizations or expressions;
● Work is the commonality of content between and among various expressions (example: Homer’s
Illiad)
● Sometimes difficult to define boundaries; differences may be cultural.
IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur
München 1998
FRBR Expression
expression: the intellectual or artistic realization of a work in the form of alpha-
numeric, musical, or choreographic notation, sound, image, object,
movement, etc., or any combination of such forms
● Any change in intellectual or artistic content constitutes a new expression
● Change in form (e.g. from alphanumeric to spoken word) - new expression
● Changes in physical form (e.g. typeface) are not new expression
● Examples of new expressions - translation
● My own “layman’s” term would be “version”
2. How do librarians bring things together?
Technology made this change..
Card Catalog - linear arrangement
● Various ways of
organizing cards but
principle of bringing
together the various
versions of a work.
● “Deduping” could be
adding call numbers for
print and microform to
same card.
A. L. A. rules for filing catalog cards
1942. "Second printing, with
corrections,April, 1943”
https://catalog.hathitrust.org/Record
/002433836
Here you see in (b)
alternative rule,
something like the
origin of “uniform title”
concept - organizing all
translations via a
heading for the original
title and language.
California Digital Library “dedup” algorithm
“DLA merges book format records through a complex algorithm that assigns numeric "weights" for
matches on different parts of the bibliographic record. When the total of these weights reaches a certain
level, the records are considered to be sufficiently alike to warrant bringing them together as a single
database record. If the total weight does not reach this level, the records are not merged.
Not all data elements have to match exactly for the records to be merged. The use of weighting means
that some variation between the records can be tolerated, as long as the overall score is high enough to
be considered a match.”
Coyle, Karen. Technical Report No. 6 RULES FOR MERGING MELVYL(R) RECORDS* Revised June 1992 (copy
provided privately).
See also Coyle, Karen, and Linda Gallaher-Brown. "Record matching: an expert algorithm." ASIS'85: Proceedings of the
American Society for Information Science (ASIS) 48th Annual Meeting. Vol. 22. 1985.
Other approaches
● VTLS Cataloging system based on FRBR entities
https://www.slideshare.net/VisionaryTechnology/vtls-8-years-experience-with-
frbr-rda-4755109
● WorldCat Work Descriptions: http://www.oclc.org/developer/develop/linked-
data/worldcat-entities/worldcat-work-entity.en.html
3. Ex Libris’s dedup and FRBR algorithms in Primo
Primo Dedup ...
● Derived from California Digital Library algorithm.
● Roughly equivalent to FRBR “Expression” level - edition of a book, director’s
cut of a movie, recording of a symphony by a particular orchestra on a certain
date
● Should bring together issuances of same content in different formats - print,
electronic, microform, etc. (manifestations)
Primo Dedup merged record
● Provides a merged record PNX - selecting one description out of the “dups”, then adding from all the
records:
○ local fields,
○ holdings/items from all the records.
● Primo’s selection of “preferred record” is based on the “delivery category” assigned by the Primo
norm rules. Current hierarchy is:
○ SFX resource
○ Electronic resource
○ Metalib resource
○ Physical item
Dedup - matching up “dups”
● Assign a “score” based on full or partial matching of selected fields, as indicated in the
“dedup” section of the PNX (created by normalization rules)
● Same field, different rules for serials, for articles, and for everything else
● If score meets target number, it’s a match.
● The Primo ingest pipe calculates match scores for every incoming record and assigns a
match ID associated with matching records. It also removes deleted records from a
match ID cluster, and adds or removes records to a match ID if their score changes.
● If changes are made to the dedup normalization rules, the records would need to be
updated (renormalization pipe or reload from source) to change.
● “Force dedup” setting on a renormalization pipe might be needed if you tinkered with
File CDLMatchingProfile can be edited
<handler id="CDLID">
<fieldID>f1,f2,f3,f4</fieldID>
<name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDComparator</name>
<arguments>
<argument name="recID_match">+200</argument>
<argument name="recID_recIDInvalid_match">+100</argument>
<argument name="recIDInvalid_match">+50</argument>
<argument name="recID_mismatch">-470</argument>
<argument name="recID_recIDInvalid_mismatch">-50</argument>
<argument name="ISBN_match">+85</argument>
<argument name="ISBN_ISSN_match">+30</argument>
<argument name="ISSN_ISSN_match">+10</argument>
<argument name="ISSN_ISBN_mismatch">-225</argument>
https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Technical_Guide/050Matching_Records_in_the_Serials_and_Non
-Serials_Dedup_Algorithm/020Customizing_the_Dedup_Algorithms
Path: primo/p4_1/ng/primo/home/profile/publish/publish/production/conf/
Normalization Rules can be modified
Dedup Test analyzes why 2 records do or don’t
dedup
5. Dedup at Emory Libraries (Laura)
● When we first implemented Primo in 2008-9, we experimented with FRBR but decided it was too
confusing for users. But we wanted dedup.
● Intent of dedup was to bring print, microform, electronic etc. versions of the same content together.
● Our big concern at implementation was, we were creating very brief records for electronic serials
from SFX, and they became the “merge record” and our lovely print CONSER full serial records
disappeared. Our solution at that time was to add 856 URLs to the serial records, making the print
record “electronic” to Primo’s norm rules, which put it on equal footing in the choice of merge record.
This was too much manual work.
● With Alma, things are better for e-serials; Community Zone e-journal records are fuller, so we can
choose fuller records for e-serials in Alma.
● From time to time when we have had dedup problems, Ex Libris support staff have suggested we
just use FRBR instead, but we have re-evaluated it and decided “no”.
The algorithm isn’t friendly to rare book cataloging.
The first edition and some of the
rare editions of this book were
deduping together.
Why? Dates...
Solution? Exclude entire library or location where
the rare stuff lives
(Screenshot of norm rules)
Rare books in microform collection still dedup
245 10 |a Libellus |h [microform] / |c F.
Barholomei de Vsingn Agustiniani de falsis prophetis tam
in persona quã doctrina vitandis a fidelibus. De recta et
mũda predicatiõe euãgelij & quibus conformiter illud
debeat predicari. ...
264 _1 |a Erphurdie [i.e. Erfurt] : |b
[Matthes Maler], |c 1525.
300 __ |a 79 pages (4to) ; |c cm.
336 __ |a text |b txt |2 rdacontent
337 __ |a microform |b h |2
rdamedia
338 __ |a microfiche |b he |2
rdacarrier
500 __ |a Signatures: A-K4.
500 __ |a Title within ornamental
border.
510 4_ |a Panzer (Annales
typographici) |c VI: 503, 63
510 4_ |a Kuczyński |c 2681
245 10 |a Libellus |h [microform] / |c F.
Bartholomei de Vsingen Augustiniani de Merito
bonorum operum. In quo veris argumentis respondet
ad instructionem fratris Mechlerij Franciscani de
bonis operibus. quam inscribit christianã. ...
264 _1 |a Erphurdie [i.e.
Erfurt] : |b [Mathes Maler], |c 1525.
300 __ |a 70 pages (4to) ; |c
cm.
336 __ |a text |b txt |2
rdacontent
337 __ |a microform |b h |2
rdamedia
338 __ |a microfiche |b he |2
rdacarrier
500 __ |a Signatures: A-I4.
500 __ |a Title within
ornamental border.
510 4_ |a Panzer (annales
typographici) |c VI: 503, 62
Other side effects -
Our digitized books from the Rose Library Special collections (not in Alma) no
longer dedup with the source physical book records from Alma - even though we
retained the record ID in the digital metadata.
Media problems
PNX
Why?
No identifiers in separate records that could break the dedup
245 (Title) subfield p (part) or v (volume) for volume number doesn’t have enough
weight to lower the score enough.
Solution - not nice
Add the MMSID for
each Alma record
for the 12 volumes
to the Dedup rule
so it will get a t99
“do not dedup”
value.
Same title same year (work in progress?...)
Both published 1999. Same composer and work (Chopin, Piano
Concertos nos. 1 & 2)
Artists (Arthur Rubinstein, Martha Argerich) are not part of dedup
algorithm!
● Ideas: Add mapping of 024 or 028 or 037 - (publisher numbers, repeatable, not consistently formatted, not
“universal”) as Universal ID (F1)
● Support suggested: Add record ID to F1 (Universal ID) as the last “or” choice, to subtract points/prevent dedups
The American movie directed by Steven Segal and the Chinese
language movie directed by Corey Yuen with the same title
were issued in 2009. I couldn’t find a thumbnail of our copy of
the Yuen movie which is a Videodisc.
6. More Dedup problems at RMIT University
(Amelia)
Genki
● Two records with a single number different in the titles
● Number displayed in roman numerals I and II
● Primo was deduping the records and only displaying title metadata related to
Genki II
● Users couldn’t find Genki I
Screenshot of the DeDup
test in Primo BO.
This is how we identified
the title field was
matching.
Solution = Changed roman numerals in title (245 $a) to numerical representation
For example: 246 $a Genki 2
More Dedup problems (Amelia)
Only Dedup within a pipe
7. Primo “FRBR clustering” (Nathalie)
● Simpler algorithm
● Uses author-title (or title only) keys to create clusters of records for a work.
● In the FRBRization part of a pipe, if a match is found based on the keys the
record is added to the same FRBR group.
FRBR matching
FRBR vector (simplified explanation)
K1 - Author part key (Fields 100 or 110 or 111 OR 700, 710, 711)
K2 - Title only key (Field 130)
K3 - Title part key (Not Serials: 240 and 245; Serials: 240 or if does not exist 245)
● Not all subfields are used.
● Normalization to remove punctuation, change to lowercase, etc.
● K1 and K3 are combined for matching, K2 is not.
FRBR problems (Nathalie, Bodleian Libraries)
● Records that you want to cluster, that don’t
● Records that cluster, that you don’t want to
● Sort order within clusters
(Examples are from http://solo.bodleian.ox.ac.uk - which has FRBR turned on, but
not dedup)
FRBR problems (Nathalie)
● Records that you want to cluster, that don’t. Example 1
FRBR problems (Nathalie)
FRBR section of the PNX records
Print record
<k3>$$Kjournal of women politics and policy$$AT</k3>
Key used for matching: none
Electronic records
<k2>$$Kjournal of women politics & policy online$$ATO</k2>
<k3>$$Kjournal of women politics and policy$$AT</k3>
Key used for matching: journal of women politics & policy online
FRBR problems (Nathalie)
● Records that you want to cluster, that don’t. Example 2
FRBR problems (Nathalie)
FRBR section of the PNX records
9th and 10th editions:
<k1>$$Kroberts harry$$AA</k1>
<k3>$$Kriley on business interruption insurance$$AT</k3>
Key used for matching: riley on business interruption insurance~roberts harry
7th and 8th editions:
<k1>$$Kcloughton david$$AA</k1>
<k1>$$Kriley denis$$AA</k1>
<k3>$$Kriley on business interruption insurance$$AT</k3>
Keys used for matching: riley on business interruption insurance~cloughton david
riley on business interruption insurance~riley denis
FRBR problems (Nathalie)
● Records that you want to cluster, that don’t. Example 3
FRBR problems (Nathalie)
FRBR section of the PNX records
Print record - incorrect metadata! (24514 $aThree sisters)
<k1>$$Kcaldwell lucy 1981$$AA</k1>
<k3>$$Ke sisters$$AT</k3>
Electronic Record
<k1>$$Kcaldwell lucy 1981$$AA</k1>
<k3>$$Kthree sisters$$AT</k3>
FRBR problems (Nathalie)
● Records that cluster, that you don’t want to
○ This is subjective!
○ The normalization rules can be used to exclude records from clustering by assigning
“<t>99</t>”
● Oxford case-study
○ Excluded from clustering: printed maps, printed music, sound recordings, video recordings,
computer software, and printed books prior to 1830.
○ Individual records can also be excluded by adding a local field to the Aleph record (which is
used by the normalization rules).
FRBR problems (Nathalie)
● Sort order within clusters
○ Set in the Back office.
● Oxford case-study
○ At Oxford we have chosen relevance as that works best for people doing known item searches
as the result they want will usually be the first record in the cluster.
○ However, Date-newest would be preferable in some situations (e.g. multiple editions of a text
book)
○ Sometimes the most “relevant” record is not what you would expect ….
FRBR problems (Nathalie)
FRBR problems (Nathalie)
8. FRBR problems (Amelia)
FRBR not occurring unexpectedly - such as minor differences in cataloging
Solution (to be implemented)
Add transformations to Normalization rules - FRBR Section
(thank-you Nathalie for the solution to this problem)
More FRBR problems (Amelia)
Tecnica dei modelli
● Fashion series split into 3 volumes
● Each volume has it’s own Alma record
● Primo was clustering the records and only displaying the $n information for
volume 3 in the search results
● Users couldn’t find volumes 1 and 2
Solution:
Add t=99 for records
with the series title 240
$a Tecnica dei modelli
Preventing FRBR (Amelia)
Other FRBR problems (Amelia)
● User understanding
○ How much do users understand about clustering?
○ How much do they need to know?
● Staff training requirements
○ How much do staff understand about clustering?
○ How much do they need to know?
■ Enough to help the users
Above: Screenshot of deduped item in Classic UI
Below: Screenshot of deduped item in New UI
DeDup : Classic Primo and New Primo
FRBR : Classic Primo and New Primo
Above: Screenshot of clustered item in Classic UI
Below: Screenshot of clustered item from New UI
Summary of issues with Primo
● 245 $n and $p not given enough weight
● Inability to DeDup or Cluster across all collections (example: Alma and PCI)
● Matching depends on textual strings in the metadata - this can have errors or
legitimate variations
● Deduping should not happen for rare book cataloging
● Lack of control on choice of the “merged record” for Deduping
● Lack of reliable identifiers in records especially for media….
● Lack of control...
The Future...
● New field approved to be added to MARC for work identifiers (URIs): 758
● Linked Data! If you define an Entity… it must have an Identifier (URI: URL
or URN).
● RDA/FRBR “Work” vs BIBFRAME “Work” (RDA Expression?)
● Not clear where the overlaps or agreements are in version 2.0
● BIBFRAME still being refined
Questions:
How might we address problems with deduping and FRBR clustering?
Should the algorithms be modified?
Should Work and Expression identifiers be generated on-the-fly in Alma and
Primo, or be generated once, be stored and be editable?
Is Primo Dedup merged display best for users? What other approaches might
work better?
Contacts:
Laura Akerman, Discovery Systems and Metadata Librarian, Emory University
liblna@emory.edu
Nathalie Schulz, Systems Analyst, Bodleian Libraries, University of Oxford
Nathalie.Schulz@bodleian.ox.ac.uk
Amelia Rowe, Applications Librarian, RMIT University
amelia.rowe2@rmit.edu.au
Credits:
● Opening image: NASA, Hubble Space Telescope image, Gas Clouds and Star Clusters, NGC 1850.jpg
● Image from Cutter, Charles A.,1837-1903, Rules for a printed dictionary catalogue. Washington: Government
Printing Office, 1875, retrieved from Hathi Trust, https://catalog.hathitrust.org/Record/009394960
● Frank Sinatra and Martha Argerich album cover and Above the Law (Segal) DVD cover thumbnails from
Amazon.com
● Artur Rubinstein album cover thumbnail from Discogs.com
● Above the Law (Yuen) DVD thumbnail from Internet Movie Database

More Related Content

What's hot

استراتيجيات تطبيق RDA في المكتبات العربية
استراتيجيات تطبيق RDA في المكتبات العربيةاستراتيجيات تطبيق RDA في المكتبات العربية
استراتيجيات تطبيق RDA في المكتبات العربية
Mohamed Mahdy
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
online training for IBM DB2 LUW UDB DBA
online training for IBM DB2 LUW UDB DBAonline training for IBM DB2 LUW UDB DBA
online training for IBM DB2 LUW UDB DBA
Ravikumar Nandigam
 
Subject analysis, an introduction
Subject analysis, an introductionSubject analysis, an introduction
Subject analysis, an introductionRichard.Sapon-White
 
Catalogación descriptiva
Catalogación descriptivaCatalogación descriptiva
Catalogación descriptiva
Reidel GR
 
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
Erick Umanchuk
 
Histórico da catalogação e da elaboração de bibliografias
Histórico da catalogação e da elaboração de bibliografiasHistórico da catalogação e da elaboração de bibliografias
Histórico da catalogação e da elaboração de bibliografias
Natallie Alcantara
 
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
turgaysahtiyan
 
MDF and LDF in SQL Server
MDF and LDF in SQL ServerMDF and LDF in SQL Server
MDF and LDF in SQL Server
Masum Reza
 
Sistema de Clasificación Decimal Dewey
Sistema de Clasificación Decimal DeweySistema de Clasificación Decimal Dewey
Sistema de Clasificación Decimal Dewey
Milled Cancel
 
LIS531M: Cataloging Microforms & Manuscripts
LIS531M: Cataloging Microforms & ManuscriptsLIS531M: Cataloging Microforms & Manuscripts
LIS531M: Cataloging Microforms & ManuscriptsJoshua Parker
 
Aula De Cdd
Aula De CddAula De Cdd
Aula De Cdd
Jonathas Carvalho
 
Proceso de Catalogación
Proceso de CatalogaciónProceso de Catalogación
Proceso de Catalogación
Anais Silva Alvarado
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basics
robin fay
 
Elab2010
Elab2010Elab2010
Elab2010
cibeleac
 
02 clasificacion bibliográfica
02 clasificacion bibliográfica02 clasificacion bibliográfica
02 clasificacion bibliográfica
Wilmer Arturo Moyano Grimaldo
 
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description AccessOs Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
Universidade de São Paulo
 
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية / إعداد محمد عب...
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية  / إعداد محمد عب...مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية  / إعداد محمد عب...
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية / إعداد محمد عب...
Muhammad Muawwad
 

What's hot (20)

استراتيجيات تطبيق RDA في المكتبات العربية
استراتيجيات تطبيق RDA في المكتبات العربيةاستراتيجيات تطبيق RDA في المكتبات العربية
استراتيجيات تطبيق RDA في المكتبات العربية
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
online training for IBM DB2 LUW UDB DBA
online training for IBM DB2 LUW UDB DBAonline training for IBM DB2 LUW UDB DBA
online training for IBM DB2 LUW UDB DBA
 
Subject analysis, an introduction
Subject analysis, an introductionSubject analysis, an introduction
Subject analysis, an introduction
 
CBU
CBUCBU
CBU
 
Catalogación descriptiva
Catalogación descriptivaCatalogación descriptiva
Catalogación descriptiva
 
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
Importancia de las bases de datos en las áreas: Empresarial, Gubernamental, E...
 
Histórico da catalogação e da elaboração de bibliografias
Histórico da catalogação e da elaboração de bibliografiasHistórico da catalogação e da elaboração de bibliografias
Histórico da catalogação e da elaboração de bibliografias
 
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
 
MDF and LDF in SQL Server
MDF and LDF in SQL ServerMDF and LDF in SQL Server
MDF and LDF in SQL Server
 
Sistema de Clasificación Decimal Dewey
Sistema de Clasificación Decimal DeweySistema de Clasificación Decimal Dewey
Sistema de Clasificación Decimal Dewey
 
Servicios de referencia
Servicios de referenciaServicios de referencia
Servicios de referencia
 
LIS531M: Cataloging Microforms & Manuscripts
LIS531M: Cataloging Microforms & ManuscriptsLIS531M: Cataloging Microforms & Manuscripts
LIS531M: Cataloging Microforms & Manuscripts
 
Aula De Cdd
Aula De CddAula De Cdd
Aula De Cdd
 
Proceso de Catalogación
Proceso de CatalogaciónProceso de Catalogación
Proceso de Catalogación
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basics
 
Elab2010
Elab2010Elab2010
Elab2010
 
02 clasificacion bibliográfica
02 clasificacion bibliográfica02 clasificacion bibliográfica
02 clasificacion bibliográfica
 
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description AccessOs Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
Os Rumos da Catalogação Contemporâneas: RDA: Resource Description Access
 
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية / إعداد محمد عب...
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية  / إعداد محمد عب...مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية  / إعداد محمد عب...
مقارنة وصف المصادر وإتاحتها مع قواعد الفهرسة الأنجلو أمريكية / إعداد محمد عب...
 

Similar to Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

Resource Description and Access at University of Zimbabwe
Resource Description and Access at University of ZimbabweResource Description and Access at University of Zimbabwe
Resource Description and Access at University of Zimbabwe
Peter Kativhu
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsLouise Spiteri
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsLouise Spiteri
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...
ÚISK FF UK
 
What is cataloging 2007 version
What is cataloging 2007 versionWhat is cataloging 2007 version
What is cataloging 2007 versionJohan Koren
 
What Is Cataloging 2007 version
What Is Cataloging 2007 versionWhat Is Cataloging 2007 version
What Is Cataloging 2007 versionJohan Koren
 
What Is Cataloging 2003 version
What Is Cataloging 2003 versionWhat Is Cataloging 2003 version
What Is Cataloging 2003 versionJohan Koren
 
FRBR model by Gaurav Boudh
FRBR model by Gaurav BoudhFRBR model by Gaurav Boudh
FRBR model by Gaurav Boudh
Library and Information Science Blog
 
What is cataloging? The Big Question
What is cataloging?  The Big QuestionWhat is cataloging?  The Big Question
What is cataloging? The Big QuestionJohan Koren
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Book of the Dead Project
Book of the Dead ProjectBook of the Dead Project
Book of the Dead Project
Barry Norton
 
Future of Cataloging, Classification
Future of Cataloging, ClassificationFuture of Cataloging, Classification
Future of Cataloging, Classification
Denise Garofalo
 
RDA: Resource Description and Access
RDA: Resource Description and AccessRDA: Resource Description and Access
RDA: Resource Description and AccessRieta Drinkwine
 
MARC 21 Training at Daffodil International University
MARC 21 Training at Daffodil International UniversityMARC 21 Training at Daffodil International University
MARC 21 Training at Daffodil International University
Nur Ahammad
 
Cataloguing
CataloguingCataloguing
Cataloguing
Sarika Sawant
 
EAD, MARC and DACS
EAD, MARC and DACSEAD, MARC and DACS
EAD, MARC and DACSmillermax
 
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
Juliya Borie
 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
National Information Standards Organization (NISO)
 

Similar to Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality (20)

Resource Description and Access at University of Zimbabwe
Resource Description and Access at University of ZimbabweResource Description and Access at University of Zimbabwe
Resource Description and Access at University of Zimbabwe
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dots
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dots
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...
 
What is cataloging 2007 version
What is cataloging 2007 versionWhat is cataloging 2007 version
What is cataloging 2007 version
 
What Is Cataloging 2007 version
What Is Cataloging 2007 versionWhat Is Cataloging 2007 version
What Is Cataloging 2007 version
 
What Is Cataloging 2003 version
What Is Cataloging 2003 versionWhat Is Cataloging 2003 version
What Is Cataloging 2003 version
 
FRBR model by Gaurav Boudh
FRBR model by Gaurav BoudhFRBR model by Gaurav Boudh
FRBR model by Gaurav Boudh
 
What is cataloging? The Big Question
What is cataloging?  The Big QuestionWhat is cataloging?  The Big Question
What is cataloging? The Big Question
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Book of the Dead Project
Book of the Dead ProjectBook of the Dead Project
Book of the Dead Project
 
Future of Cataloging, Classification
Future of Cataloging, ClassificationFuture of Cataloging, Classification
Future of Cataloging, Classification
 
RDA: Resource Description and Access
RDA: Resource Description and AccessRDA: Resource Description and Access
RDA: Resource Description and Access
 
MARC 21 Training at Daffodil International University
MARC 21 Training at Daffodil International UniversityMARC 21 Training at Daffodil International University
MARC 21 Training at Daffodil International University
 
Cataloguing
CataloguingCataloguing
Cataloguing
 
EAD, MARC and DACS
EAD, MARC and DACSEAD, MARC and DACS
EAD, MARC and DACS
 
Resource description and access
Resource description and accessResource description and access
Resource description and access
 
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

Clusters from outer space: Primo Deduping and FRBRizing in Context and Reality

  • 1. Clusters from Outer Space Primo Deduping and FRBRizing in Context and Reality Laura Akerman, Nathalie Schulz, Amelia Rowe With help from Lukas Koster IGELU Annual Meeting September 12 2017 St. Petersburg, Russia
  • 2. 1. Why do librarians bring things together? It’s called “collocation”...
  • 3. Cutter’s rules for a dictionary catalog, 1875
  • 4. Functional Requirements for Bibliographic Records, 1991 The study uses an entity analysis technique that begins by isolating the entities that are the key objects of interest to users of bibliographic records. The study then identifies the characteristics or attributes associated with each entity and the relationships between entities that are most important to users in formulating bibliographic searches, interpreting responses to those searches, and “navigating” the universe of entities described in bibliographic records. IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur München 1998
  • 5. It’s all about what users do ● using the data to find materials that correspond to the user’s stated search criteria (e.g., in the context of a search for all documents on a given subject, or a search for a recording issued under a particular title); ● using the data retrieved to identify an entity (e.g., to confirm that the document described in a record corresponds to the document sought by the user, or to distinguish between two texts or recordings that have the same title); ● using the data to select an entity that is appropriate to the user’s needs (e.g., to select a text in a language the user understands, or to choose a version of a computer program that is compatible with the hardware and operating system available to the user); ● using the data in order to acquire or obtain access to the entity described (e.g., to place a purchase order for a publication, to submit a request for the loan of a copy of a book in a library’s collection, or to access online an electronic document stored on a remote computer). IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur München 1998
  • 6. FRBR Work work: a distinct intellectual or artistic creation.* ● An abstract entity - no one material item to point to ● Recognized in realizations or expressions; ● Work is the commonality of content between and among various expressions (example: Homer’s Illiad) ● Sometimes difficult to define boundaries; differences may be cultural. IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records. K . G. Saur München 1998
  • 7. FRBR Expression expression: the intellectual or artistic realization of a work in the form of alpha- numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms ● Any change in intellectual or artistic content constitutes a new expression ● Change in form (e.g. from alphanumeric to spoken word) - new expression ● Changes in physical form (e.g. typeface) are not new expression ● Examples of new expressions - translation ● My own “layman’s” term would be “version”
  • 8. 2. How do librarians bring things together? Technology made this change..
  • 9. Card Catalog - linear arrangement ● Various ways of organizing cards but principle of bringing together the various versions of a work. ● “Deduping” could be adding call numbers for print and microform to same card. A. L. A. rules for filing catalog cards 1942. "Second printing, with corrections,April, 1943” https://catalog.hathitrust.org/Record /002433836
  • 10. Here you see in (b) alternative rule, something like the origin of “uniform title” concept - organizing all translations via a heading for the original title and language.
  • 11. California Digital Library “dedup” algorithm “DLA merges book format records through a complex algorithm that assigns numeric "weights" for matches on different parts of the bibliographic record. When the total of these weights reaches a certain level, the records are considered to be sufficiently alike to warrant bringing them together as a single database record. If the total weight does not reach this level, the records are not merged. Not all data elements have to match exactly for the records to be merged. The use of weighting means that some variation between the records can be tolerated, as long as the overall score is high enough to be considered a match.” Coyle, Karen. Technical Report No. 6 RULES FOR MERGING MELVYL(R) RECORDS* Revised June 1992 (copy provided privately). See also Coyle, Karen, and Linda Gallaher-Brown. "Record matching: an expert algorithm." ASIS'85: Proceedings of the American Society for Information Science (ASIS) 48th Annual Meeting. Vol. 22. 1985.
  • 12. Other approaches ● VTLS Cataloging system based on FRBR entities https://www.slideshare.net/VisionaryTechnology/vtls-8-years-experience-with- frbr-rda-4755109 ● WorldCat Work Descriptions: http://www.oclc.org/developer/develop/linked- data/worldcat-entities/worldcat-work-entity.en.html
  • 13. 3. Ex Libris’s dedup and FRBR algorithms in Primo
  • 14. Primo Dedup ... ● Derived from California Digital Library algorithm. ● Roughly equivalent to FRBR “Expression” level - edition of a book, director’s cut of a movie, recording of a symphony by a particular orchestra on a certain date ● Should bring together issuances of same content in different formats - print, electronic, microform, etc. (manifestations)
  • 15. Primo Dedup merged record ● Provides a merged record PNX - selecting one description out of the “dups”, then adding from all the records: ○ local fields, ○ holdings/items from all the records. ● Primo’s selection of “preferred record” is based on the “delivery category” assigned by the Primo norm rules. Current hierarchy is: ○ SFX resource ○ Electronic resource ○ Metalib resource ○ Physical item
  • 16. Dedup - matching up “dups” ● Assign a “score” based on full or partial matching of selected fields, as indicated in the “dedup” section of the PNX (created by normalization rules) ● Same field, different rules for serials, for articles, and for everything else ● If score meets target number, it’s a match. ● The Primo ingest pipe calculates match scores for every incoming record and assigns a match ID associated with matching records. It also removes deleted records from a match ID cluster, and adds or removes records to a match ID if their score changes. ● If changes are made to the dedup normalization rules, the records would need to be updated (renormalization pipe or reload from source) to change. ● “Force dedup” setting on a renormalization pipe might be needed if you tinkered with
  • 17. File CDLMatchingProfile can be edited <handler id="CDLID"> <fieldID>f1,f2,f3,f4</fieldID> <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDComparator</name> <arguments> <argument name="recID_match">+200</argument> <argument name="recID_recIDInvalid_match">+100</argument> <argument name="recIDInvalid_match">+50</argument> <argument name="recID_mismatch">-470</argument> <argument name="recID_recIDInvalid_mismatch">-50</argument> <argument name="ISBN_match">+85</argument> <argument name="ISBN_ISSN_match">+30</argument> <argument name="ISSN_ISSN_match">+10</argument> <argument name="ISSN_ISBN_mismatch">-225</argument> https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Technical_Guide/050Matching_Records_in_the_Serials_and_Non -Serials_Dedup_Algorithm/020Customizing_the_Dedup_Algorithms Path: primo/p4_1/ng/primo/home/profile/publish/publish/production/conf/
  • 18. Normalization Rules can be modified
  • 19. Dedup Test analyzes why 2 records do or don’t dedup
  • 20.
  • 21. 5. Dedup at Emory Libraries (Laura) ● When we first implemented Primo in 2008-9, we experimented with FRBR but decided it was too confusing for users. But we wanted dedup. ● Intent of dedup was to bring print, microform, electronic etc. versions of the same content together. ● Our big concern at implementation was, we were creating very brief records for electronic serials from SFX, and they became the “merge record” and our lovely print CONSER full serial records disappeared. Our solution at that time was to add 856 URLs to the serial records, making the print record “electronic” to Primo’s norm rules, which put it on equal footing in the choice of merge record. This was too much manual work. ● With Alma, things are better for e-serials; Community Zone e-journal records are fuller, so we can choose fuller records for e-serials in Alma. ● From time to time when we have had dedup problems, Ex Libris support staff have suggested we just use FRBR instead, but we have re-evaluated it and decided “no”.
  • 22. The algorithm isn’t friendly to rare book cataloging. The first edition and some of the rare editions of this book were deduping together. Why? Dates...
  • 23. Solution? Exclude entire library or location where the rare stuff lives (Screenshot of norm rules)
  • 24. Rare books in microform collection still dedup
  • 25. 245 10 |a Libellus |h [microform] / |c F. Barholomei de Vsingn Agustiniani de falsis prophetis tam in persona quã doctrina vitandis a fidelibus. De recta et mũda predicatiõe euãgelij & quibus conformiter illud debeat predicari. ... 264 _1 |a Erphurdie [i.e. Erfurt] : |b [Matthes Maler], |c 1525. 300 __ |a 79 pages (4to) ; |c cm. 336 __ |a text |b txt |2 rdacontent 337 __ |a microform |b h |2 rdamedia 338 __ |a microfiche |b he |2 rdacarrier 500 __ |a Signatures: A-K4. 500 __ |a Title within ornamental border. 510 4_ |a Panzer (Annales typographici) |c VI: 503, 63 510 4_ |a Kuczyński |c 2681 245 10 |a Libellus |h [microform] / |c F. Bartholomei de Vsingen Augustiniani de Merito bonorum operum. In quo veris argumentis respondet ad instructionem fratris Mechlerij Franciscani de bonis operibus. quam inscribit christianã. ... 264 _1 |a Erphurdie [i.e. Erfurt] : |b [Mathes Maler], |c 1525. 300 __ |a 70 pages (4to) ; |c cm. 336 __ |a text |b txt |2 rdacontent 337 __ |a microform |b h |2 rdamedia 338 __ |a microfiche |b he |2 rdacarrier 500 __ |a Signatures: A-I4. 500 __ |a Title within ornamental border. 510 4_ |a Panzer (annales typographici) |c VI: 503, 62
  • 26. Other side effects - Our digitized books from the Rose Library Special collections (not in Alma) no longer dedup with the source physical book records from Alma - even though we retained the record ID in the digital metadata.
  • 28.
  • 29. PNX
  • 30. Why? No identifiers in separate records that could break the dedup 245 (Title) subfield p (part) or v (volume) for volume number doesn’t have enough weight to lower the score enough.
  • 31. Solution - not nice Add the MMSID for each Alma record for the 12 volumes to the Dedup rule so it will get a t99 “do not dedup” value.
  • 32. Same title same year (work in progress?...) Both published 1999. Same composer and work (Chopin, Piano Concertos nos. 1 & 2) Artists (Arthur Rubinstein, Martha Argerich) are not part of dedup algorithm! ● Ideas: Add mapping of 024 or 028 or 037 - (publisher numbers, repeatable, not consistently formatted, not “universal”) as Universal ID (F1) ● Support suggested: Add record ID to F1 (Universal ID) as the last “or” choice, to subtract points/prevent dedups The American movie directed by Steven Segal and the Chinese language movie directed by Corey Yuen with the same title were issued in 2009. I couldn’t find a thumbnail of our copy of the Yuen movie which is a Videodisc.
  • 33. 6. More Dedup problems at RMIT University (Amelia) Genki ● Two records with a single number different in the titles ● Number displayed in roman numerals I and II ● Primo was deduping the records and only displaying title metadata related to Genki II ● Users couldn’t find Genki I
  • 34. Screenshot of the DeDup test in Primo BO. This is how we identified the title field was matching.
  • 35. Solution = Changed roman numerals in title (245 $a) to numerical representation For example: 246 $a Genki 2
  • 36. More Dedup problems (Amelia) Only Dedup within a pipe
  • 37. 7. Primo “FRBR clustering” (Nathalie) ● Simpler algorithm ● Uses author-title (or title only) keys to create clusters of records for a work. ● In the FRBRization part of a pipe, if a match is found based on the keys the record is added to the same FRBR group.
  • 38. FRBR matching FRBR vector (simplified explanation) K1 - Author part key (Fields 100 or 110 or 111 OR 700, 710, 711) K2 - Title only key (Field 130) K3 - Title part key (Not Serials: 240 and 245; Serials: 240 or if does not exist 245) ● Not all subfields are used. ● Normalization to remove punctuation, change to lowercase, etc. ● K1 and K3 are combined for matching, K2 is not.
  • 39. FRBR problems (Nathalie, Bodleian Libraries) ● Records that you want to cluster, that don’t ● Records that cluster, that you don’t want to ● Sort order within clusters (Examples are from http://solo.bodleian.ox.ac.uk - which has FRBR turned on, but not dedup)
  • 40. FRBR problems (Nathalie) ● Records that you want to cluster, that don’t. Example 1
  • 41. FRBR problems (Nathalie) FRBR section of the PNX records Print record <k3>$$Kjournal of women politics and policy$$AT</k3> Key used for matching: none Electronic records <k2>$$Kjournal of women politics & policy online$$ATO</k2> <k3>$$Kjournal of women politics and policy$$AT</k3> Key used for matching: journal of women politics & policy online
  • 42. FRBR problems (Nathalie) ● Records that you want to cluster, that don’t. Example 2
  • 43. FRBR problems (Nathalie) FRBR section of the PNX records 9th and 10th editions: <k1>$$Kroberts harry$$AA</k1> <k3>$$Kriley on business interruption insurance$$AT</k3> Key used for matching: riley on business interruption insurance~roberts harry 7th and 8th editions: <k1>$$Kcloughton david$$AA</k1> <k1>$$Kriley denis$$AA</k1> <k3>$$Kriley on business interruption insurance$$AT</k3> Keys used for matching: riley on business interruption insurance~cloughton david riley on business interruption insurance~riley denis
  • 44. FRBR problems (Nathalie) ● Records that you want to cluster, that don’t. Example 3
  • 45. FRBR problems (Nathalie) FRBR section of the PNX records Print record - incorrect metadata! (24514 $aThree sisters) <k1>$$Kcaldwell lucy 1981$$AA</k1> <k3>$$Ke sisters$$AT</k3> Electronic Record <k1>$$Kcaldwell lucy 1981$$AA</k1> <k3>$$Kthree sisters$$AT</k3>
  • 46. FRBR problems (Nathalie) ● Records that cluster, that you don’t want to ○ This is subjective! ○ The normalization rules can be used to exclude records from clustering by assigning “<t>99</t>” ● Oxford case-study ○ Excluded from clustering: printed maps, printed music, sound recordings, video recordings, computer software, and printed books prior to 1830. ○ Individual records can also be excluded by adding a local field to the Aleph record (which is used by the normalization rules).
  • 47. FRBR problems (Nathalie) ● Sort order within clusters ○ Set in the Back office. ● Oxford case-study ○ At Oxford we have chosen relevance as that works best for people doing known item searches as the result they want will usually be the first record in the cluster. ○ However, Date-newest would be preferable in some situations (e.g. multiple editions of a text book) ○ Sometimes the most “relevant” record is not what you would expect ….
  • 50. 8. FRBR problems (Amelia) FRBR not occurring unexpectedly - such as minor differences in cataloging
  • 51. Solution (to be implemented) Add transformations to Normalization rules - FRBR Section (thank-you Nathalie for the solution to this problem)
  • 52. More FRBR problems (Amelia) Tecnica dei modelli ● Fashion series split into 3 volumes ● Each volume has it’s own Alma record ● Primo was clustering the records and only displaying the $n information for volume 3 in the search results ● Users couldn’t find volumes 1 and 2
  • 53. Solution: Add t=99 for records with the series title 240 $a Tecnica dei modelli Preventing FRBR (Amelia)
  • 54. Other FRBR problems (Amelia) ● User understanding ○ How much do users understand about clustering? ○ How much do they need to know? ● Staff training requirements ○ How much do staff understand about clustering? ○ How much do they need to know? ■ Enough to help the users
  • 55. Above: Screenshot of deduped item in Classic UI Below: Screenshot of deduped item in New UI DeDup : Classic Primo and New Primo
  • 56. FRBR : Classic Primo and New Primo Above: Screenshot of clustered item in Classic UI Below: Screenshot of clustered item from New UI
  • 57. Summary of issues with Primo ● 245 $n and $p not given enough weight ● Inability to DeDup or Cluster across all collections (example: Alma and PCI) ● Matching depends on textual strings in the metadata - this can have errors or legitimate variations ● Deduping should not happen for rare book cataloging ● Lack of control on choice of the “merged record” for Deduping ● Lack of reliable identifiers in records especially for media…. ● Lack of control...
  • 58. The Future... ● New field approved to be added to MARC for work identifiers (URIs): 758 ● Linked Data! If you define an Entity… it must have an Identifier (URI: URL or URN). ● RDA/FRBR “Work” vs BIBFRAME “Work” (RDA Expression?) ● Not clear where the overlaps or agreements are in version 2.0 ● BIBFRAME still being refined
  • 59. Questions: How might we address problems with deduping and FRBR clustering? Should the algorithms be modified? Should Work and Expression identifiers be generated on-the-fly in Alma and Primo, or be generated once, be stored and be editable? Is Primo Dedup merged display best for users? What other approaches might work better?
  • 60. Contacts: Laura Akerman, Discovery Systems and Metadata Librarian, Emory University liblna@emory.edu Nathalie Schulz, Systems Analyst, Bodleian Libraries, University of Oxford Nathalie.Schulz@bodleian.ox.ac.uk Amelia Rowe, Applications Librarian, RMIT University amelia.rowe2@rmit.edu.au
  • 61. Credits: ● Opening image: NASA, Hubble Space Telescope image, Gas Clouds and Star Clusters, NGC 1850.jpg ● Image from Cutter, Charles A.,1837-1903, Rules for a printed dictionary catalogue. Washington: Government Printing Office, 1875, retrieved from Hathi Trust, https://catalog.hathitrust.org/Record/009394960 ● Frank Sinatra and Martha Argerich album cover and Above the Law (Segal) DVD cover thumbnails from Amazon.com ● Artur Rubinstein album cover thumbnail from Discogs.com ● Above the Law (Yuen) DVD thumbnail from Internet Movie Database

Editor's Notes

  1. (Laura) The origin of this talk is, I started receiving a spate of dedup issues reported by other librarians at Emory University and thought it’d be an interesting topic. But I wanted to take a higher level view of the process and wanted to include FRBR which we don’t have experience of. So I put out a call to collaborate on the Primo list and was delighted to find great collaborators in Nathalie Schulz from the Bodleian Library, Oxford University, and Amelia Rowe from RMIT University in Melbourne, Australia.
  2. Before we get into the juicy problems, I will start off with a little background - bear with it… Librarians have been bringing descriptions together for a very long time to assist users to find what they want.
  3. Bringing all versions of a work together before online catalogs involved arranging cards for different version to be found together in the card catalog, as well as on the shelf due to in the classification numbers assigned to books. Works were generally to be found/identified in the card catalog by author and title, but if there was any ambiguity, a uniform title could be constructed which would be unique in combination with the author name. Certain special authors might have more elaborate arrangment into sections by language, special sections for complete and selected works, compilations by form (e.g. “Poetical works”). This led to different sorts of uniform titles
  4. In 1991 the International Federation of Library Associations published Functional Requirements for Bibliographic Records. This was a result of intense committee work to develop metadata requirements for libraries at the national level. The group analyzed both user tasks that needed support, and a definition of conceptual entities that lay behind those tasks and their relationships. This was really a new conceptualization of description for discovery of information resources.
  5. Some of the things we think users do include: determining if an electronic version of the same text and edition that exists in print is available through the library - or vice versa (some users prefer print); determining if a particular described sound recording contains a particular song; finding a specific rare printing of a book described in a specialized bibliography
  6. So in this presentation it is good to look at the FRBR Work and Expression entities and the differences between them. Work is abstract and something that could have grey areas - I am thinking of some manuscripts and serially published things... It’s a mental idea of all the versions of what we think of as being “the same work”.
  7. This is a bit of ancient history for most libraries today, but in a dictionary catalog, alphabetical arrangement of titles (within an author section of the catalog if there was an author main entry) would bring together different editions of books.
  8. Merging and deduping of records for “same content” was needed when large scale union catalogs, incorporating records from many sources, became possible with library automation and online catalogs. In an email to me, Karen Coyle, one of the authors of the algorithm, pointed out that author names were not as reliable when this was developed, due to a mix of old and new cataloging rules (AACR1 and AACR2), and that times have changed. This algorithm was the source for the dedup algorithm in Primo
  9. I’ve not had the time to do deep research into other system models but wanted to note these. VTLS is now owned by Innovative Interfaces; their literature speaks of users being able to search once and retrieve all related versions of a work including those with variant titles and different languages. OCLC has been developing algorithms to identify “work entities” and associate them with clusters of records; these identifiers are available under an Open Data License and can be found in the linked data section of
  10. So this is not a full tutorial but I’m just going to hit some highlights about how dedup works in Primo. The approach is based on the California Digital Library’s approach to merging “duplicate records” from its database. Only instead of weeding out duplicates (which you should do in your ILS system before things get to Primo!), the idea is to combine descriptions of essentially the same content that may have different formats - e.g. electronic, microform, print; CD vs. streaming audio; etc.
  11. Here’s where the tricky part comes in. There really isn’t a science behind assigning a score to each combination of two records and calling it a match if it meets certain threshold. That was developed through trial and error by California Digital Library.
  12. This is just a small part of the algorithm that, if you have server access to Primo, you could find and edit. You can see that it scores the point scores for various kinds of matches in data elements between two records. I think we have tried this a couple of times but have not had much success. If others have, would be interested to hear your experiences. This file is not protected from being overwritten by updates, as far as I know.
  13. Here you have your average normalization rule for dedup. Field “F5” contains a title. The condition says this rule is only for “non-serials”. It takes the title 245 field, including title, subtitle, number and part. Then it heavily normalizes the string to remove initial articles and various punctuation - brackets, ampersands, etc. There’s another rule for F5 for serials - it operates on the 022 field subfield z (invalid ISSN) These rules can be modified - at your own risk!
  14. The DEDUP Test utility is a wonderful tool for understanding the complex process by which the dedup stage of record loading determines whether two records match and assigns them a dedupmrg number and match ID - or not. This happens in two stages. Title, date and record identifier get checked and basically, if the record IDs don’t match but the title and date do, it goes to full comparison. This is where “points” are assigned for full or partial matching, or subtracted for non-matching. Notice that the short title matches here and that gets 450 points
  15. Notice that the long title doesn’t match completely, but enough words match so that it still gets 400 points. More about this later...
  16. I regret I don’t have a screenshot of the deduped record here, but imagine all of these records deduping together. The date matching actually allows a couple of years variance. 25 points are subtracted for lack of exact match within 2 years, not enough to prevent the deduping in some cases.
  17. I tried various less drastic changes, but they weren’t working. Our Special Collections library (now called The Stuart A. Rose Library for Manuscripts and Rare Books) was impatient - a class was going to study this autobiography and its editions, and they couldn’t abide Primo clumping them together (and a lot of other rare editions as well). This is just one example of many. So we ultimately added a rule to give all records with an item in that library a “do not dedup” value of t 99. Later, we did the same for the special collections location in the Theology library.
  18. But rare books in microform or electronic collections are still clumping...
  19. We want our rare books that we have digitized to dedup with the digital version, but they no longer do so - this is a tradeoff we’ve not found a solution for.
  20. Twelve fabulous recordings of “ol’ blue eyes’ in our collection. Here’s a somewhat fuzzy image of the cover of Vol 7, noting the songs “Night and Day”, “But Beautiful”, “The Song is You”, and “What’ll I Do?”
  21. When I searched for Frank Sinatra the Columbia Years in our Primo, I got two results - the record for the set, and a record for “Vol. 10 the Complete Recordings”. What happened to volume 7? If we look at the item details, we can see what happened. Primo deduped all the volumes. So only the contents note of vol. 10 displays.
  22. Viewing the PNX in the PNX viewer and clicking on “Match ID” confirms this
  23. The first problem, I neutralized in Production by adding “t 99” to these record IDs. Someone reported the second one - the videos - while I was at this conference. There are publisher identifier numbers in tags 024 or 028 or 037, that could be added but both fields are repeatable, there may be inconsistencies in the formatting, and these numbers aren’t universal. I’m nervous about using them. The suggestion to add the Alma record ID to F1, the Universal ID field to which is mapped the 010 (Lc card number) might result in breaking this dedup, but how many others that we do want to happen would it break? We will test these approaches but are not optimistic. Now I’m going to turn this over to Amelia Rowe who’ll tell you about more fascinating dedup problems at RMIT University.
  24. Items not deduping between collections/pipes is the greatest cause for confusion for our users. We can teach them about DeDup and clustering but if they see a behaviour that doesn’t match what they’ve been taught they get confused and think something is wrong. Pipe related notes: Because Dedup/FRBR is done at the pipe level there is always content that isn’t dedup/frbr-izing as the end user would expect. At RMIT we ingest resources from a variety of locations (with 10 active pipes) Some resources may be available between multiple pipes, and/or in PCI. When some records don’t dedup but others do this causes confusion especially for staff. Screenshot = record that is the same in our Research repository and our research bank, yet they don’t dedup because they are in different pipe. In this instance the TN_rmit_res33432 record is an RMIT originating record that has found it’s way into the PCI Note: users don’t see the record ID I have used a bookmarklet to display this for my own purposes. There is no fix/solution for this
  25. Records are assigned to FRBR clusters in Primo as part of the pipe. Keys based on the author(s)/titles are compared with other records and if a match is found the record is then added to the same FRBR group. A record can only be part of one FRBR group.
  26. There are three different types of keys in the FRBR Vector: Author part - uses the “main entry” (1XX fields) and if this is not present the added entry author fields (7XX) Title only key - uniform title from field 130 Title part key - uniform title from field 240 and title from 245 (except for serials which have a 240). There are other fields included in the normalization rules for when there is no 240 or 245 field but as records are rejected from Primo if there is no 245$a, these will rarely be used. Not all subfields are used, e.g. subfield $l (Language) in the 240 field is not used, which allows the original and translations of a work to cluster together. The Author part keys and Title part keys and combined to make strings for matching, while the Title only key is not There is a detailed explanation on the Ex Libris Customer Centre at: https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Technical_Guide/040FRBRization/010The_FRBR_Vector
  27. Sometimes records simply do not have enough information to create matching keys. In this case the print record does not have any author information so there is only a title key. The two online titles that cluster have the same uniform title.
  28. The print record only has a title part key, and this is not used for FRBR matching on its own. The electronic records have a K2 fields which are used on their own for matching. Note: the keys used for matching are stored in the p_frbr_keys table.
  29. If the author changes between editions (which often happens with legal works), the keys won’t match and so they do not cluster.
  30. The 9th and 10th editions have a 100 field for “Roberts, Harry”. The 7th and 8th editions have 700 fields for “Cloughton, David” and “Riley, Denis”.
  31. In this example the reason for the records not clustering is not immediately apparent.
  32. Primo can only work with the metadata as found in the records - if this is incorrect, as in this case (non-filing indicator), the records will not cluster.
  33. The University of Oxford has had Primo since 2008, but until mid-2011 when we moved to Aleph there was also a separate OPAC. Moving to “Primo only” meant staff in the libraries started to look more closely at the clustering. As part of a review we trialled (in a test version) turning off clustering. After staff testing and usability testing the decision was made to go with partial clustering and there have not been any calls to change this. The normalization rules to exclude clustering make use of fields from the Aleph records, both standard FMT fields and local RTP (Record Type) fields which is how we identify most of the pre-1830 books. We also have a local “SOL” field that we can use to exclude individual records from clustering.
  34. When we were reviewing clustering at Oxford, we had the default sort order set to “Date newest”. Some of the complaints that people had about clustering were because the specific record they wanted could be hard to find within a cluster. Changing to sorting by “relevance” helped with this. However, there are times when the “relevant” record is unexpected.
  35. In this example, Primo is considering the Spanish translation to be the most relevant and is presenting it as the “top” record in the cluster
  36. See the screenshot for an example of two records not deduping (record 1 and record 3) because of the ampersand in the title Record 1 is the 5th and 3rd editions FRBRized Record 3 is the 2nd and 4th editions FRBRized Users would expect all of these record to be found under the one record
  37. In this example FRBR is correct however the confusion for users meant we had to change the system's behaviour for users
  38. Note: Setting t=99 is a solution we typically try to avoid as it risks creating very complicated frbr:t normalization rules Typically we try to edit the cataloguing to prevent dedup
  39. While those of us who work closely with Primo and have a concept of FRBR and what it is and how it works in Primo the majority of library staff and our users do not have this understanding. This makes the display in Primo confusing for users. Have you tried explaining FRBR to your staff or users? I’ve explained it many times to staff and still there is not a clear understanding of what it is within Primo. To help overcome this confusion staff at RMIT are working on an online IST (in-service training) module specifically related to DeDup and FRBR to help educate our staff who can in turn help the users.
  40. Sharing how dedup and Clustering are presented in New vs Classic Primo UI In some instances the New UI is more user friendly in the way it presents the records Ex. deduped print and electronic material appear in the same full record instead of separate tabs
  41. In classic view versions was easily lost in the top right hand corner of the record. Now it is part of the records availability Note: we are thinking of changing the terminology from “versions” to “editions and formats” in the hope of using terminology that better explains the functionality to our end users.