Successfully reported this slideshow.
Your SlideShare is downloading. ×

Automating Controlled Vocabulary Reconciliation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 19 Ad

More Related Content

Viewers also liked (20)

Similar to Automating Controlled Vocabulary Reconciliation (20)

Advertisement

Recently uploaded (20)

Automating Controlled Vocabulary Reconciliation

  1. 1. Automating Controlled Vocabulary Reconciliation Anna Neatrour Metadata Librarian Jeremy Myntti Interim Head, Digital Library Services
  2. 2. Summary  Metadata inconsistency  Overview of vendor authority process  Further work with Open Refine  Next steps 2 http://www.utahindians.org
  3. 3. Inconsistency Gosiute Indians Goshute Indians Navajo Indians Navaho Indians Salt Lake Salt Lake City Salt Lake City (Utah) Bishop, Dail Stapley Bishop, Dale Stapely Bishop, Dale Stapley Beckwith, Frank A. (1876-1951) Beckwith, Frank Asahel (1876- 1951) Beckwith, Frank A. Beckwith, Frank A. (1876-1951) Beckwith, Frank Asahel (1876-1951) Beckwith, Frank Asahel, 1876-1951 3 Woven basket or jug; http://content.lib.utah.edu/cdm/ref/collection/ UU_Photo_Archives/id/13887
  4. 4. Project Timeline 4 June-Sept. 2012 – Define project Oct. 2012 – May 2013 – Testing June 2013 – Contracted with Backstage Library Works June 2013-Feb. 2014 – Continued testing Feb.-May 2014 – 17 collections processed June-Aug. 2014 – Manual review (intern) April 2015-today – Explore OpenRefine
  5. 5. Methodology 5 <title>A group of St. George (Sibwit) Paiutes and Wickiups (cedar)</title> <subjec>Paiute Indians; Ute Indians--History; Wickiups; Indians of North America--Dwellings;</subject> <covspa>Utah;</covspa> <descri>A group of people sitting and standing in front of a brush shelter;<descri> <publis>Digitized by: J. Willard Marriott Library, University of Utah;</publis> <type>Image;StillImage;</type> <format>image/jpeg;</format> http://content.lib.utah.edu/cdm/ref/collection/uaida/id/14697
  6. 6. Backstage: statistics and reports 6 Unmatched report Change report
  7. 7. Backstage: standardization Capitalization, Punctuation, and Updated Authorized Access Points Forests and Forestry – Utah forests and forestry -- Utah Forest lands - Utah Forests and forestry--Utah 7 A group of Navajos at Navajo Mountain government school; http://content.lib.utah.edu/cdm/ref/collection/ uaida/id/43551
  8. 8. Backstage: problems encountered Missing MARC tags  Names treated as topical headings and vice versa  Provo => Provisional IRA Data in wrong fields  Date: Price Hiram, 1814- 1901 Incorrect match  Local names matching wrong records  Johnson, Abe is not Johnson, F. T. 8 Walker War Map 1853-1854; http://content.lib.utah.edu/cdm/ref/collection/ uaida/id/15474
  9. 9. Intern review and clean-up 9
  10. 10. OpenRefine project ◦ Used UAIDA as a pilot, since it had the greatest number of unmatched names due to the size of the collection (over 8,000 items) ◦ 529 unmatched names after Backstage process 10 Navajo woman weaving, http://content.lib.utah.edu/cdm/ref/collection/uaida/id/45379
  11. 11. OpenRefine: two approaches  Reconciliation process developed by Jenn Wright and Matt Carruthers, University of Michigan Library, https://github.com/mcarruthers /LCNAF-Named-Entity- Reconciliation  Reconciliation process developed by Roderic Page, http://iphylo.blogspot.com/2 013/04/reconciling-author- names-using-open.html 11 A group of Navajo children and teenagers, http://content.lib.utah.edu/cdm/ref/collection/uaida/id/43285
  12. 12. OpenRefine: differences in results  Both processes found name matches through searching VIAF. ◦ Wright and Carruthers’ process looked for a matching LC authority record in the VIAF cluster  81 records were matched, 132 were false matches, and 312 number had no match ◦ Page’s process matched names to authors in a more general fashion  70 records were matched, 37 were false matches, and 449 had no match. 12
  13. 13. OpenRefine: manual work  Check matches against collection and discard false matches 13
  14. 14. OpenRefine: updating UAIDA  We updated an additional 455 records with updated names.  405 matches were from both processes, 38 were unique to Wright and Carruthers and 5 were matched by the Page process. 14 Eight Hopi Baskets, http://content.lib.utah.edu/cdm/ref/collection/uaida/id/45009
  15. 15. Open Refine: student work  Fall 2015 – student ran additional unmatched items from other collections through OpenRefine with Wright & Carruthers process  Metadata librarian currently reviewing student work and updating collections 15
  16. 16. Next Steps Create local and regional controlled vocabularies 16
  17. 17. Next Steps: Reconcile across more collections  CONTENTdm metadata exported in SOLR  Easier to get list of personal names across all collections  Explore other reconciliation methods 17
  18. 18. Next Steps URIs in Digital Collections Metadata, MWDL (Primo), and DPLA 18 http://content.lib.utah.edu/cdm/ref/ collection/uaida/id/43183
  19. 19. Questions? 19 Anna Neatrour | anna.neatrour@utah.edu Metadata Librarian Jeremy Myntti | jeremy.myntti@utah.edu Interim Head, Digital Library Services Forthcoming article:, Use Existing Data First: Reconcile Metadata Before Creating New Controlled Vocabularies. Journal of Library Metadata. http://dx.doi.org/10.1080/19386389.2015.1099989

Editor's Notes

  • I’m Anna Neatrour, Metadata librarian at the University of Utah Marriott Library. I’m also presenting on behalf of Jeremy Myntti who is Interim Head of Digital Library Services.
  • I’ll provide an overview of the metadata problems we had, what the vendor supplied authority service did, go over further reconciliation work, and talk about our plans for the future. Throughout this presentation you’ll see examples of items from the Utah American Indian Digital Archive, which was one of the collections we processed.

  • We’ve been creating digital library collections for a long time. Our existing records had a great deal of inconsistency both for personal names and subjects. Having six different ways of expressing the name of one person is really bad, and leads to problems for users in faceting and discovery.
  • Project started in 2012, and we’re just moving on to a different stage of it today. 17 of our collections were processed by Backstage Library Works, and we did additional review and post processing.
  • Processing digital collections XML files similar to existing processes developed at Backstage for MARC records for names and subjects. Had previously done other projects at our library where we changed raw CONTENTdm metadata.

    Backstage did their automated authority control on the CONTENTdm desc.all file, which is the way metadata is stored internally in the system. Different from the file you get if you export metadata as a collection manager in CONTENTdm.
    If you have hosted contentdm you cannot do this.

    Skip-----------

    Extracting data from CONTENTdm
    Stop updates to collection and make it read-only
    Make copy of desc.all metadata file for backup.
    Run desc.all file through AC processing from Backstage
    Replace desc.all file on CONTENTdm server
    Run the full collection index
    Remove read-only status from collection

  • Get reports on matches, no matches, and changes report back from Backstage.

    In matched headings report, it also included URIs for id.loc.gov and VIAF in some cases, so we have URIs for items we can then use if we want to express our information as linked data in the future.


    Skip-----------------

    For UAIDA:
    Creator/Contributor names (7033)
    10% changed (669)
    48% matched 3342)

    Subjects (98931)
    21% changed (21072)
    76% matched (75471)
  • In addition to matching authorities, Backstage helped with fixing punctuation problems. Here’s an example of three poorly punctuated subject heading variants that got fixed.

    Skip-------

    Space double dash space
    Word em dash
    Single dash
    Convert all to double dash
    Each collection may be different, so need to watch out for which ways to standardize (single dash may not need to be converted in some collections)

    Capitalizing every word
    Capitalizing nothing
    Convert to the correct capitalization according to LCSH

    Older forms of names used (pre-RDA)
    Cross references used rather than authorized access point
    This happened a lot because of training issues. Many students for a few years didn’t realize the correct form of an access point that should be used from the 1xx in the auth

    “access points” = heading
  • Backstage makes certain assumptions about the headings
    Part of that is shunting data into a topical subject heading
    as opposed to a geographic or personal or corporate name headings

    See sample of matching issues on this slide:

    -skip-

    So “Provo” goes in as a 650 field; our system performs the match as a
    Subject heading; so we find an authority where the 110 field is:
    Provisional IRA
    And the 410 field is:
    Provo

    With “Cars” we searched this as a generic name heading, and lopped off
    The date of 2002, so we found the conference heading instead

    We found that we need to be careful with single-word headings in CDM
    In fact, our recommendation might be to not search those at all or,
    if we do, then to report them rather than update them
  • Backstage made 85,000 changes in our digital collections. There were 200 problems in the data that they changed that were fixed after review
    Need for lots of manual review for some collections
    Intern reviewed collections for 3 months – fixed nearly 2000 mistakes (mostly from metadata rather than Authority Control process

    Process pointed out need for training to encourage more consistency. (use correct access points, standardize punctuation, spacing, capitalization, date formats, etc., field usage, NACO or local auths?
  • Used Utah American Indian Digital Archive first since it was a large collection. Wanted to do further reconciliation for the names Backstage wasn’t able to match and see if we could enhance the collections even more.
  • We tested out two different approaches to matching personal names to see what would work best.
  • Found better results with matching just against LC name authorities, as would be expected since the collection we were working on was a regional Utah collection.
  • We have no Carlos Santana materials in this collection! Manually reviewed matches in Open Refine.
  • Updated the desc.all file locally and reindexed the collection to get these changes live.

    At the end of this we combined all the matched names, and did a further update of the Utah American Indian Digital Archive desc.all collection.

    Skip-------

    Wright and Carruthers process had 262 undo/redo actions in the open refine project
    Page’s reconciliation process resulted in 424 undo/redo actions
  • Updating manually if only a few changes, but we can also script against the desc.all as we did with UAIDA if any of the other collections have extensive updates.
  • Create NACO records for notable people
    Investigate local controlled vocabularies for more regional personal names
  • A few weeks ago, our metadata for our CONTENTdm collections was dumped into SOLR.
    Expect it will be much easier to work with.
    Building unified list of creators and contributors.
    Want to explore additional means of reconciliation.
  • Put URIs in digital collections metadata, test how they appear in our repository, in MWDL, and DPLA.
    We have been putting geonames URIs in our metadata, but not personal name URIs yet.
    Could easily add the URIs at a future date when our repository can do more with them.
    Have several collections now where we have confidence in our metadata being cleaned up and matched to authorities, so this is a great base as we want to explore using Linked Data more.
  • We have an article coming out in the Journal of Library Metadata that goes into greater detail about what I presented here today. Happy to take your questions.

×