Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian
Upcoming SlideShare
Loading in...5
×
 

Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

on

  • 1,387 views

In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote: ...

In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote:

THIS is the last issue of "The Cornell Veterinarian". The "Cornell Vet" has a proud history, dating back to June, 1911... (p.1)

This presentation will describe Cornell University Library efforts to provide an "afterlife" to The Cornell Veterinarian by leveraging a number of disparate initiatives and metadata sources. While attempting to build article level linking to full-text in HathiTrust (functionality currently unavailable), limitations in the metadata captured during the scanning process were uncovered. The speaker will delineate these metadata findings and provide strategies (some scalable, others highly labor intensive) for gathering the necessary metadata for creating direct links to articles found in HathiTrust.

Presenter:
Steven Folsom
Cornell University
Steven Folsom is a metadata librarian overseeing the creation and management of metadata for various Cornell University Library digital platforms. He strategizes on the integration of metadata across systems with the ultimate goal of improving discovery and access of information resources.

Statistics

Views

Total Views
1,387
Views on SlideShare
650
Embed Views
737

Actions

Likes
1
Downloads
5
Comments
0

28 Embeds 737

http://personanondata.blogspot.com 120
http://flavors.me 119
http://www.michaelcairns.net 108
http://personanondata.blogspot.co.uk 85
http://michaelcairns.net 81
https://twitter.com 54
http://www.personanondata.blogspot.com 34
http://www.personanondata.blogspot.co.uk 24
http://personanondata.blogspot.in 17
http://personanondata.blogspot.fr 16
http://digg.com 11
http://personanondata.blogspot.pt 9
http://feedly.com 9
http://personanondata.blogspot.de 7
http://personanondata.blogspot.cz 6
http://news.google.com 6
http://personanondata.blogspot.ca 6
http://www.personanondata.blogspot.ru 5
http://personanondata.blogspot.ru 4
http://personanondata.blogspot.hk 3
http://feeds.feedburner.com 3
http://personanondata.blogspot.nl 3
http://personanondata.blogspot.com.au 2
http://personanondata.blogspot.ch 1
http://personanondata.blogspot.it 1
http://www.personanondata.blogspot.gr 1
http://www.personanondata.blogspot.cz 1
http://personanondata.blogspot.dk 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian Presentation Transcript

  • Wrangling Metadata from HathiTrust and PubMed: Providing Full-text Linking to The Cornell Veterinarian Photo credit: http://www.walls.com/ Steven Folsom, NASIG Annual Conference 2014
  • Cornell Library Digital Consulting and Production Services  A single-point of service for those wishing to create digital collections  A virtual group that spans multiple departments within the Library (Digital Scholarship and Preservation Services, Cornell Library IT and Metadata Librarians from Library Technical Services)  Approaches digital collection building holistically, and addresses the entire life cycle management of a project Steven Folsom, NASIG Annual Conference 2014
  • The Cornell Veterinarian Project Participants Client:  Cornell Flower-Sprecher Veterinary Library DCAPS Involvement:  Jaron Porciello, Digital Scholarship Initiatives Coordinator  Michelle Paolillo, Project Manager/Business Analyst (CUL’s HathiTrust Liaison)  John Cline, Cornell Library Programmer  Steven Folsom, Metadata Librarian Steven Folsom, NASIG Annual Conference 2014
  • HathiTrust Digital Library  Digital Library consisting of the Google Books project, Internet Archive digitization initiatives, and content digitized locally by libraries  Committed to preserving content with stable access and distributed/coordinated cost of storage  Centralized technical framework with that allows for the creation of tools and services Steven Folsom, NASIG Annual Conference 2014
  • The Cornell Veterinarian Steven Folsom, NASIG Annual Conference 2014
  • The Challenge Steven Folsom, NASIG Annual Conference 2014
  • Hathi Volume Interface Steven Folsom, NASIG Annual Conference 2014
  • Google Books: Contributions from Cornell Library  Participation in the Google Books Library Project since 2008  Google focuses on materials that they have not already digitized  Using OCLC holdings information, they compose a Cornell candidate list Steven Folsom, NASIG Annual Conference 2014
  • HathiTrust Data API Steven Folsom, NASIG Annual Conference 2014
  • Hathi METS File Steven Folsom, NASIG Annual Conference 2014
  • METS File Continued Steven Folsom, NASIG Annual Conference 2014
  • Hathifiles  Tab-delimited full files of the Hathi Digital Library and incremental updates (Full file is currently over 2.5 GB uncompressed)  Light Bibliographic data  Includes some administrative metadata, e.g. rights information, the originating institution for the scanned copy Steven Folsom, NASIG Annual Conference 2014
  • Select Hathifile Record Elements Hathi Volume ID: mdp.39015076694507 Access: allow [Notes on mapping for rights attributes where contextual user data would affect access] Rights: pd [public domain] HathiTrust record number: 000529434 Enumeration/Chronology: v.33 no.11 1900 Source: MIU Title: The Chicago medical times OCLC number: 1554176 Steven Folsom, NASIG Annual Conference 2014
  • HathiTrust Bibliographic API  Meant for use to retrieve information about small numbers of items at a time  Returns bibliographic, rights, and volume information when given a single or multiple standard identifiers (ISBN, LCCN, OCLC, etc.), includes overlap with the Hathifile data  Brief example: http://catalog.hathitrust.org/api/volumes/brief/oclc/424023. json  Full example:http://catalog.hathitrust.org/api/volumes/full/oclc /424023.json Steven Folsom, NASIG Annual Conference 2014
  • Hathi Metadata Recap • Administrative data about scans and corresponding volumes • Uses Hathi id’s to link to bibliographic data • Bulk Bibliographic data • Some administrative data, e.g. Rights information • Small requests for Bibliographic data retrieved using standard identifiers (ISBN, LCCN, OCLC…) Steven Folsom, NASIG Annual Conference 2014
  • What we thought was the solution….  Use Hathi Data API to find Table of Contents for each Volume  Gather the related OCR  Parse out article citation values from the OCR (Hopefully in a mostly automated way)  Use the pagination data from TOC to build links by mapping to pagination in the METS files.  What couldn’t be automated would be done manually (with the projected outcome being an citation index with Hathi URLs that could be used to build an interface or given to an index like PubMed) Steven Folsom, NASIG Annual Conference 2014
  • Reality set in… Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com
  • HathiTrust OCR Steven Folsom, NASIG Annual Conference 2014
  • The metadata continued to fight back… Photo credit: http://glpiggy.net/ Steven Folsom, NASIG Annual Conference 2014
  • PubMed Indexing and API Steven Folsom, NASIG Annual Conference 2014
  • A Path for Automation For each citation already in PubMed for which the HathiTrust has one volume 1. Search PubMed <Volume> AND the Hathi Catalog id (000535347) for The Cornell Veterinarian against the Hathi File to get the corresponding Hathi object id from the METS 2. Use the METS object id AND the PubMed start page (the numeric value before the ‘-“ for each PubMed article citation to find the <ORDERLABEL> to get the <Order> number from the METS file 3. Create the URL to be added to the PubMed XML. The Hathi METS object id and <Order> number are used to create the URL. The sequence number in this URL equals the <Order> number. The METS id equals the id in the URL, http://babel.hathitrust.org/cgi/pt?id=coo.31924051143075;view=1 up;seq=11 Steven Folsom, NASIG Annual Conference 2014
  • NCBI’s LinkOut Program  A service that allows third parties to link specific NCBI database records to relevant web-accessible resources  The relevant journal/publication must already have gone through the Medline selection process  Document Type Definition (DTD) for contributing links in XML Steven Folsom, NASIG Annual Conference 2014
  • PubMed Citation Data Requirements  PubMED DTD specifies how the data should be formatted  Data Tags (R = Required, O = Optional O/R = Optional or Required). Required tags must be included; optional tags must be included only if the data requested appears in the print or electronic article. Optional or Required tags are dependent on the use of other tags  Tag names are case sensitive Steven Folsom, NASIG Annual Conference 2014
  • PubMed Citation Data Elements File Header (R) ArticleSet (R) Article (R) Journal (R) PublisherName (R) JournalTitle (R) Issn (R) Volume (O/R) Issue (O/R) PubDate (R) Year (R) Month (O/R) Season (O) Day (O) Replaces (O) ArticleTitle (O) VernacularTitle (O) FirstPage (O/R) LastPage (O) ELocationID (O/R) Language (O) AuthorList (O/R) Author (R) FirstName (O/R) MiddleName (O) LastName (O/R) Suffix (O) CollectiveName (O) Affiliation (O) Identifier (O) GroupList (O/R) Group (R) GroupName (R) IndividualName (O) PublicationType (O) ArticleIdList (O/R) ArticleId (R) History (O) Abstract (O) OtherAbstract (O) CopyrightInformation (O) ObjectList (O) Object (O) Param (O) Steven Folsom, NASIG Annual Conference 2014
  • In an Ideal World… Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/
  • The metadata that got away…  Pre-1945 issues not indexed by PubMed  Supplemental volumes* What we hope to do about it:  Manually capture the Hathi URL’s for the supplemental volumes and provide them to PubMed using their linking format  Manually capture citation data for pre-1945 articles using the OCR files, and send to PubMed using their indexing format. Steven Folsom, NASIG Annual Conference 2014
  • Project Outcomes Soft:  Better understanding of what’s possible with Hathi API’s  Better understanding of PubMed’s metadata/URL contribution requirements  Increased desire within the Cornell Library to consider greater return on our HathiTrust investment Concrete:  The Cornell Veterinarian should be available via PubMed for the years already indexed soon  Manually capturing the complete backfile for The Cornell Veterinarian to contribute to PubMed Steven Folsom, NASIG Annual Conference 2014
  • Future Considerations  Potential for improved access to other titles currently lacking full-text linking in PubMed [if in HathiTrust]  Investigations into other (non)full-text indexes and fulltext repositories  New Services for interacting with HathiTrust Digital Library  Potential improvements to the Hathi workflows. Steven Folsom, NASIG Annual Conference 2014
  • Questions? Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com