Bowen: Assistant Dean for Inf. Mangagemnet Services, Chair of XC organizationHickey: Chief Scientists with OCLC instrumental in FRBR workSalaba: Associate Prof, served on IFLA FRBFR working groupZhang: professor, worked with Salaba on IMLS funded FRBR project, published implementing frbr in libraries with Salaba - 2009See the handout for more information
This session grounded in the wide range of work done by both panelists and others not at this sessionDramatic changes in library metadata, LC’s commitment to RDA and a new data modelLC Bibliographic framework transition update – Mariott Grand Salon AC 10:30OCLC and XC’s work on new discovery and metadata management platformsResearch at universities like Kent State and development of international vocabularies (VIAF)Exploratory activities by Digital public library of americaCentral questions of this session:What role does FRBR play in these organizations and activities?How does FRBR fit with current and future work in library metadata?Additional question – how important is this data model as our notion of authorship, document versions and publishing platforms change?
We want to start this session with a quick exploration of some of the metadata issues that libraries are encountering as we explore new models including FRBR and LoD.We explored metadata quality issues that arose when applying the FRBR model to a selected set of recordsFocus of this research is on getting a grasp on the ‘real-life’ of metadata in libraries
Our research had two primary questions. First – What metadata quality problems arise in application of FRBRization algorithmsSecond – How do computational and expert approaches compare with regards to FRBRizationSecond question is very large – expect that our other panelists will have some intersting perspectives on this question
Lets start with a quick re-introuction of FRBR entitiesOur focus is on group 1 entities and really on work-set generation.We found that identification of work sets was most intensive, identification of expressions and manifestations within a work-set was more managable
Introduce hierarchy. Talk about some interesting issues with frbr, egHighly theoretical,Code-neutral – Not cataloging rulesSystem-neutralGood for designing information systemsCollates results – groups like records (frbr book 5)Question: Is frbr a discovery only grouping or a permanent metadata model shift? Question has implications in the linked data world – as we turn to new biblilographic standards (e.g. LC BI) is FRBR the right direction and (our question is) what metadata issues enable or complicate this?We were coming from a metadata quality perspective and were wondering what positive and negative factors might impact a frbrization. If FRBR is only a discovery service then these issues are not as significant, but if the library world is moving to a new metadata storage model that uses frbr relationships then these metadata problems may be very significant.
Methodology focused on understanding metadata quality by comparing expert manual and automated approachesQuantified differences in outcomes, explored qualitative differences in metadata records.
1. Why MT – needed limited data set to enable expert comparison with book-in-hand approach
Describe worksheetVariation in spreadhseet metadata made it difficult to group worksetsTurned to book-in-hand approach for 5 titles (100 books ) – Title, TOC, text, introductionsGrouped records into work-sets using metadataCreated matrix of unique identifiers “Book-in-hand” analysis of 5 titles (approx. 100 books)Title page (recto and verso)Table of contentsText analysisTell story of discovered metadata in introductions – stuff no tin the MARC record
My review found 410 records grouped into 147 work-sets, 420 records with only one expressionNumbers don’t add up – we did not look at lost or missing or checked out recordsLargest set was HF, TSReview metadata used
These are three records from our set.These records are for a critical editiion of HF – a book with critical essays along with the HF textIssue 1 – This points to the need for to represent whole/part or manifestation to multiple work relationships but there is no metadata across all three records that helps do this (240 only in one record)Issue 2 – Metadata inconsistency (240 in one record but not others) led grouping differences based on approach – differences in record length and composition – only one record mentions having the full text to HF, Invalid unique idenfiers in record 1 (Dyn number)
For example – the Adv of HF has been re-written in many waysAbridgements, additions (raft chapter), changing voice (HF over TS)For example, Neider did most extensive edit cut 8300, added 5700 wordsIfla says “Revisions, abridgements, enlargements, and translations of earlier text are viewed as expressions of the same work. But, a work that has been modified with a significant degree of independent intellectual or artistic effort is viewed as a new work.”Is this book in particular a new expression or a wholly new work? – especially when neider says he wants to bring out voice of HF and subdue TS.Clearly, this is not in the MARC record.
Erik picks up.With expert review under out belt, we selected a specific automated approach to identify work setsUsed the FRBR work-set algorithm by Hickey & Toves (2009) as our foundationFour keys, the first of which is a combination of author/titleSide note – in our record set, there were few uniform titles and we found that OCLC number matches were very low – focused on first keyIn Direct string matching we found that small issues such as extra spaces, punctuation or inclusion of extra words were considerable problems (Think about all of the titles of huckfinn)
We were looking for a looser method by which we could compare these stringsInspired by Google refine and work of seth van hoolandWorked with key collision to normalize stringsUsed Levenshteinedit distance algorithms in GR and found that this helped group titles into worksetsExpanded on this with custom Python scripts that implemented KC and Levensthein for more comprehensive analysisStudy 1: Compare OCLC FRBR Key against KC version using edit distance (found that KC version was significatnly better – asist paper)Study 2: Evaluating metadata quality through Expert and automatic comparison – the focus of our talk today.\\Based on OCLC’s FRBR workset algorithm (Hickey and Toves, 2009)Google Refine, PythMerge keys with cataloger’s spreadsheetApplied Key Collision to 3 of 5 keysConverts alphabetic characters to lower caseRemoves all non alpha-numeric charactersSorts words alpha-numericallyRemoves duplicate wordsCompare differences between keys of associated work-setssequenceMatcher method of the difflib Python libraryRatio method of the Levenshtein Python library
The focus of our work is on work-set identification We found that because our selected records were largely similar with regard to form and publishing venues that we did not get a good distribution of expression and manifestation dataTwo findings that were interestingE-books had re-purposed records that lacked updated leaders, Print-book assumption is valid for our data but could be problematic for more diverse datasets
Our goal in using similarity matching was to improve our automated matches.Other research in similairty scores for larger texts indicates that a similarity score greater than .6 shows a high overlap.We found that because we were dealing with much shorter strings this was not the case.We also found that the variation in titles (some long, some short) and in how those were encoded created some very low similarity scores for matchesThis chart shows the distribution of similairty score calculations for all records for two keys – 245a and our heavigly normalized keys. As you can see a processed score pushes similairty up – this is not suprising. Suprising aspect – more fields introduced in frbr keyAverage length calculationFor 245aFor frbr keyOvercoming longer strings (245a avg length diff was 17.09, frbr key avg length diff was 40.14).
This chart shows the impact of string normalization – we were able to boost many of the results from >.8 to 1Incidentally, we also found a very high overlap between expert and automated techniques for 100% match - (515 for expert, 521 for automatic)Automated process found relationships that cataloger had questions on (huckfinn differences)Around 80% overlap we found false positives however – process became the needle/haystack problemIn fact – the average similarity score for expert matched titles using OCLC Key was .909 (as opposed to .884 for 245ab and .686 for entire title) but this hides the real distribution and the associated metadata problems
For example - .465 – this is the same work but differences in how record was cataloged prevented detectionOn the other hand, the .909 show that even a high similarity score cannot always be trustedSimiliarty interesting – confidence leveDifferent from full text comparison because we have fewer words.465 – same work but low similarity because of difference in 245ab.909 - high confidence but not the titles
I know we are short on time – a few of our high level findings:Metadata quality is an important issue but syntax and ISBD adherence is less important than we thoughtAnectodal observation of record length, lack of description for complex work relationships were significantThe application of the FRBR model (or any new relationship model) is expensive and requires some advanced techniques – the question of permanence and confidence were central for us (e.g. – do you do this once? Do you trust it implicitly? What is the hybrid approach?)Carolyn – 30 hoursComputer 5 minutesScale is an issue 848 records turns in to 350,000 comparisonsPre-clustering, canopying – a common problem in CS.Advances in Linked data models means that FRBRizationmay simply be part of the interesting set of data used. I expect our panelists might have some thoughts on what role these frbr relationships might play in record encoding and use in new metadata structures
FRBR indicates a hierarchical view – we found that relationships can be more complex & a node view can helpWe found that the lack of metadata consistency was solveable with enough processing but lack of metadata made it impossible to determine some relationshipsIt seems somewhat clear that we are still creating metadata with specific uses in mind – what happens with new formats & new publishing models?Is FRBR worth it?
Current Research on and Use of FRBR in Libraries
Current research onand use of FRBR inlibraries ALA Annual Conference 2012 June 24, 8:00 am #alctsan12
PanelistsJennifer Bowen University of Rochester, River Campus LibrariesThomas Hickey OCLCCarolyn McCallum Wake Forest University, Z. Smith Reynolds LibraryErik Mitchell University of Maryland iSchoolAthena Salaba Kent State University, School of Library and Information ScienceYin Zhang Kent State University, School of Library and Information Science
FRBRizing Mark TwainCarolyn McCallum, Wake Forest UniversityDr. Erik Mitchell, University of Maryland iSchoolALA Annual Conference, June 24, 2012
Metadata quality and FRBRization ofMARC What metadata quality issues are of concern in the migration of metadata to new models (e.g. FRBR)How do computational techniques compare to expert cataloger FRBR application?
Group 1 EntitiesWork: a distinct intellectual or artistic creation ; abstract.Expression: the intellectual or artistic realization of a work ; abstract.Manifestation: the physical embodiment of an expression of a work.Item: a single exemplar of a manifestation.
Group 1 Entities HierarchyA work is realized through an expression, which in turn is embodied in a manifestation, which is exemplified by an item.
FRBRIn 1998, Functional Requirements for Bibliographic Records: Final Report was published by IFLA (International Federation of Library Associations and Institutions)New conceptual model to represent the “bibliographic universe”. Based on an entity-relationship model.User tasks are identified (i.e. find, identify, select, and obtain)Comprised of 10 entities, Arranged into 3 groups.
Research methodologyUse a selected author’s MARC records (Mark Twain)Have expert cataloger assess work- set relationshipsCompare expert assessment with computational FRBR algorithms
Record evaluation848 MARC records from Mark Twain (1835-1910) ZSR Library’s catalogPS1300-1348PseudonymsMultiple publications by and about himCreate complex work-set relationshipsReveal FRBRization issues
Cataloger assessmentID Notes Worksets Title FRBR Key =245 10$aAdventures of Huckleberry Finn :$bincluding the omitted, long, brilliant raft chapter, with the final "Tom Sawyer" section, abridged twain, mark1835 1910/ventures of /$cby Mark Twain ; edited huckleberry finn :including the omitted, long, with an introduction by brilliant raft chapter, with the final "Tom77491CIRC - checked out 4700, 77491 Charles Neider. Sawyer" section, abridged / RARE (PS1305 A1 1885) - green cloth binding w/black & gold stamping ; published by Charles L. Webster and Company ; 1st ed. w/o cancels, i.e. 3rd state of t.p., 4th state 56932, 103713, 209036, p. 293 ; "With One 14123, 14225, 17328, Hundred and Seventy- 219513, 12581, 20431, four Illustrations" on t.p., 285512, 1226354, 213068, frontispiece includes 14169, 13933, 33902, =245 10$aAdventures of plate of Mark Twains 2128369, 2128371, Huckleberry Finn (Tom bust w/his signature - 2128372, 2128373, Sawyers comrade) ...$cby opposite frontispiece 2128546, 2129785, Mark Twain [pseud.] with one Huck Finn w/shotgun and 2079693, 2128352, hundred and seventy-four twain, mark1835 1910/ventures of14169rabbit ; has TEYTHF 104424 illustrations. huckleberry finn (tom sawyers comrade)
Expert cataloger assessmentExpert cataloger review 410 records – 147 total work-sets with 2 or more expressions 420 work-sets with only one expressionLargest work-sets The Adventures of Huckleberry Finn (26 records) The Adventures of Tom Sawyer (14 records)Metadata most helpful in grouping work-sets Title Author Combination of both title and authorRereading Work and Expression sections of IFLA’s FRBR report
Key issue – Determining workboundariesExpressions Revisions Abridgements Enlargements TranslationsWorks Rewritings Adaptations for children or from one literary or art form to another Parodies Abstracts Digests Summaries
Identifying works using title /authorOCLC FRBR work-set keys (Hickey and Toves, 2009) twain, mark | 1835 1910/adventures of huckleberry finn :including the omitted, long, brilliant raft chapter, with the final "Tom Sawyer" section, abridged /Other approaches Unique record identifiers Other descriptive metadata fields Field for cataloger assignment of work-sets
Evaluating similarity in worksScore Key 1 Key 20.465 1835 1910 albert an and appreciation 1835 1910 mark speeches twain twains bigelow by dean howells introduction mark paine speeches twain twains william with0.706 1835 1910 adventures ago comrade 1835 1910 adventures comrade finn fifty finn forty huckleberry mark huckleberry mark of sawyers tom twain mississippi of sawyers scene the time to tom twain valley years0.902 1601 1835 1910 as by conversation 1601 1835 1910 as at conversation fireside in it date fireside in it mark of social the mark of or social the time tudors twain was time tudors twain twains was0.909 1835 1910 a dogs mark tale twain 1835 1910 a horses mark tale twain0.909 1835 1910 a drama mark sawyer tom 1835 1910 abroad mark sawyer tom twain twain
Metadata quality assessmentMain quality issue: Lack of description for complex relationshipsModel application is expensive in both computer and expert effortsNew data models (LoD) are changing our view of MARC metadata