Wrangling Metadata from
HathiTrust and PubMed:
Providing Full-text Linking to The Cornell Veterinarian
Photo credit: http:...
Cornell Library Digital Consulting
and Production Services
 A single-point of service for those wishing to create digital...
The Cornell Veterinarian Project
Participants
Client:
 Cornell Flower-Sprecher Veterinary Library
DCAPS Involvement:
 Ja...
HathiTrust Digital Library
 Digital Library consisting of the Google Books project,
Internet Archive digitization initiat...
The Cornell Veterinarian
Steven Folsom, NASIG Annual Conference 2014
The Challenge
Steven Folsom, NASIG Annual Conference 2014
Hathi Volume Interface
Steven Folsom, NASIG Annual Conference 2014
Google Books:
Contributions from Cornell Library
 Participation in the Google Books Library Project since
2008
 Google f...
HathiTrust Data API
Steven Folsom, NASIG Annual Conference 2014
Hathi METS File
Steven Folsom, NASIG Annual Conference 2014
METS File Continued
Steven Folsom, NASIG Annual Conference 2014
Hathifiles
 Tab-delimited full files of the Hathi Digital Library and
incremental updates (Full file is currently over 2....
Select Hathifile Record Elements
Hathi Volume ID: mdp.39015076694507
Access: allow [Notes on mapping for rights attributes...
HathiTrust Bibliographic API
 Meant for use to retrieve information about small numbers of
items at a time
 Returns bibl...
Hathi Metadata Recap
• Administrative
data about
scans and
corresponding
volumes
• Uses Hathi id’s
to link to
bibliographi...
What we thought was the solution….
 Use Hathi Data API to find Table of Contents for each
Volume
 Gather the related OCR...
Reality set in…
Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com
HathiTrust OCR
Steven Folsom, NASIG Annual Conference 2014
The metadata continued to fight back…
Photo credit: http://glpiggy.net/ Steven Folsom, NASIG Annual Conference 2014
PubMed Indexing and API
Steven Folsom, NASIG Annual Conference 2014
A Path for Automation
For each citation already in PubMed for which the HathiTrust has one
volume
1. Search PubMed <Volume...
NCBI’s LinkOut Program
 A service that allows third parties to link specific NCBI
database records to relevant web-access...
PubMed Citation Data Requirements
 PubMED DTD specifies how the data should be
formatted
 Data Tags (R = Required, O = O...
PubMed Citation Data Elements
File Header (R)
ArticleSet (R)
Article (R)
Journal (R)
PublisherName (R)
JournalTitle (R)
Is...
In an Ideal World…
Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/
The metadata that got away…
 Pre-1945 issues not indexed by PubMed
 Supplemental volumes*
What we hope to do about it:
...
Project Outcomes
Soft:
 Better understanding of what’s possible with Hathi API’s
 Better understanding of PubMed’s metad...
Future Considerations
 Potential for improved access to other titles currently
lacking full-text linking in PubMed [if in...
Questions?
Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com
Upcoming SlideShare
Loading in...5
×

Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

3,279

Published on

In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote:

THIS is the last issue of "The Cornell Veterinarian". The "Cornell Vet" has a proud history, dating back to June, 1911... (p.1)

This presentation will describe Cornell University Library efforts to provide an "afterlife" to The Cornell Veterinarian by leveraging a number of disparate initiatives and metadata sources. While attempting to build article level linking to full-text in HathiTrust (functionality currently unavailable), limitations in the metadata captured during the scanning process were uncovered. The speaker will delineate these metadata findings and provide strategies (some scalable, others highly labor intensive) for gathering the necessary metadata for creating direct links to articles found in HathiTrust.

Presenter:
Steven Folsom
Cornell University
Steven Folsom is a metadata librarian overseeing the creation and management of metadata for various Cornell University Library digital platforms. He strategizes on the integration of metadata across systems with the ultimate goal of improving discovery and access of information resources.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,279
On Slideshare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian

  1. 1. Wrangling Metadata from HathiTrust and PubMed: Providing Full-text Linking to The Cornell Veterinarian Photo credit: http://www.walls.com/ Steven Folsom, NASIG Annual Conference 2014
  2. 2. Cornell Library Digital Consulting and Production Services  A single-point of service for those wishing to create digital collections  A virtual group that spans multiple departments within the Library (Digital Scholarship and Preservation Services, Cornell Library IT and Metadata Librarians from Library Technical Services)  Approaches digital collection building holistically, and addresses the entire life cycle management of a project Steven Folsom, NASIG Annual Conference 2014
  3. 3. The Cornell Veterinarian Project Participants Client:  Cornell Flower-Sprecher Veterinary Library DCAPS Involvement:  Jaron Porciello, Digital Scholarship Initiatives Coordinator  Michelle Paolillo, Project Manager/Business Analyst (CUL’s HathiTrust Liaison)  John Cline, Cornell Library Programmer  Steven Folsom, Metadata Librarian Steven Folsom, NASIG Annual Conference 2014
  4. 4. HathiTrust Digital Library  Digital Library consisting of the Google Books project, Internet Archive digitization initiatives, and content digitized locally by libraries  Committed to preserving content with stable access and distributed/coordinated cost of storage  Centralized technical framework with that allows for the creation of tools and services Steven Folsom, NASIG Annual Conference 2014
  5. 5. The Cornell Veterinarian Steven Folsom, NASIG Annual Conference 2014
  6. 6. The Challenge Steven Folsom, NASIG Annual Conference 2014
  7. 7. Hathi Volume Interface Steven Folsom, NASIG Annual Conference 2014
  8. 8. Google Books: Contributions from Cornell Library  Participation in the Google Books Library Project since 2008  Google focuses on materials that they have not already digitized  Using OCLC holdings information, they compose a Cornell candidate list Steven Folsom, NASIG Annual Conference 2014
  9. 9. HathiTrust Data API Steven Folsom, NASIG Annual Conference 2014
  10. 10. Hathi METS File Steven Folsom, NASIG Annual Conference 2014
  11. 11. METS File Continued Steven Folsom, NASIG Annual Conference 2014
  12. 12. Hathifiles  Tab-delimited full files of the Hathi Digital Library and incremental updates (Full file is currently over 2.5 GB uncompressed)  Light Bibliographic data  Includes some administrative metadata, e.g. rights information, the originating institution for the scanned copy Steven Folsom, NASIG Annual Conference 2014
  13. 13. Select Hathifile Record Elements Hathi Volume ID: mdp.39015076694507 Access: allow [Notes on mapping for rights attributes where contextual user data would affect access] Rights: pd [public domain] HathiTrust record number: 000529434 Enumeration/Chronology: v.33 no.11 1900 Source: MIU Title: The Chicago medical times OCLC number: 1554176 Steven Folsom, NASIG Annual Conference 2014
  14. 14. HathiTrust Bibliographic API  Meant for use to retrieve information about small numbers of items at a time  Returns bibliographic, rights, and volume information when given a single or multiple standard identifiers (ISBN, LCCN, OCLC, etc.), includes overlap with the Hathifile data  Brief example: http://catalog.hathitrust.org/api/volumes/brief/oclc/424023. json  Full example:http://catalog.hathitrust.org/api/volumes/full/oclc /424023.json Steven Folsom, NASIG Annual Conference 2014
  15. 15. Hathi Metadata Recap • Administrative data about scans and corresponding volumes • Uses Hathi id’s to link to bibliographic data • Bulk Bibliographic data • Some administrative data, e.g. Rights information • Small requests for Bibliographic data retrieved using standard identifiers (ISBN, LCCN, OCLC…) Steven Folsom, NASIG Annual Conference 2014
  16. 16. What we thought was the solution….  Use Hathi Data API to find Table of Contents for each Volume  Gather the related OCR  Parse out article citation values from the OCR (Hopefully in a mostly automated way)  Use the pagination data from TOC to build links by mapping to pagination in the METS files.  What couldn’t be automated would be done manually (with the projected outcome being an citation index with Hathi URLs that could be used to build an interface or given to an index like PubMed) Steven Folsom, NASIG Annual Conference 2014
  17. 17. Reality set in… Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com
  18. 18. HathiTrust OCR Steven Folsom, NASIG Annual Conference 2014
  19. 19. The metadata continued to fight back… Photo credit: http://glpiggy.net/ Steven Folsom, NASIG Annual Conference 2014
  20. 20. PubMed Indexing and API Steven Folsom, NASIG Annual Conference 2014
  21. 21. A Path for Automation For each citation already in PubMed for which the HathiTrust has one volume 1. Search PubMed <Volume> AND the Hathi Catalog id (000535347) for The Cornell Veterinarian against the Hathi File to get the corresponding Hathi object id from the METS 2. Use the METS object id AND the PubMed start page (the numeric value before the ‘-“ for each PubMed article citation to find the <ORDERLABEL> to get the <Order> number from the METS file 3. Create the URL to be added to the PubMed XML. The Hathi METS object id and <Order> number are used to create the URL. The sequence number in this URL equals the <Order> number. The METS id equals the id in the URL, http://babel.hathitrust.org/cgi/pt?id=coo.31924051143075;view=1 up;seq=11 Steven Folsom, NASIG Annual Conference 2014
  22. 22. NCBI’s LinkOut Program  A service that allows third parties to link specific NCBI database records to relevant web-accessible resources  The relevant journal/publication must already have gone through the Medline selection process  Document Type Definition (DTD) for contributing links in XML Steven Folsom, NASIG Annual Conference 2014
  23. 23. PubMed Citation Data Requirements  PubMED DTD specifies how the data should be formatted  Data Tags (R = Required, O = Optional O/R = Optional or Required). Required tags must be included; optional tags must be included only if the data requested appears in the print or electronic article. Optional or Required tags are dependent on the use of other tags  Tag names are case sensitive Steven Folsom, NASIG Annual Conference 2014
  24. 24. PubMed Citation Data Elements File Header (R) ArticleSet (R) Article (R) Journal (R) PublisherName (R) JournalTitle (R) Issn (R) Volume (O/R) Issue (O/R) PubDate (R) Year (R) Month (O/R) Season (O) Day (O) Replaces (O) ArticleTitle (O) VernacularTitle (O) FirstPage (O/R) LastPage (O) ELocationID (O/R) Language (O) AuthorList (O/R) Author (R) FirstName (O/R) MiddleName (O) LastName (O/R) Suffix (O) CollectiveName (O) Affiliation (O) Identifier (O) GroupList (O/R) Group (R) GroupName (R) IndividualName (O) PublicationType (O) ArticleIdList (O/R) ArticleId (R) History (O) Abstract (O) OtherAbstract (O) CopyrightInformation (O) ObjectList (O) Object (O) Param (O) Steven Folsom, NASIG Annual Conference 2014
  25. 25. In an Ideal World… Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/
  26. 26. The metadata that got away…  Pre-1945 issues not indexed by PubMed  Supplemental volumes* What we hope to do about it:  Manually capture the Hathi URL’s for the supplemental volumes and provide them to PubMed using their linking format  Manually capture citation data for pre-1945 articles using the OCR files, and send to PubMed using their indexing format. Steven Folsom, NASIG Annual Conference 2014
  27. 27. Project Outcomes Soft:  Better understanding of what’s possible with Hathi API’s  Better understanding of PubMed’s metadata/URL contribution requirements  Increased desire within the Cornell Library to consider greater return on our HathiTrust investment Concrete:  The Cornell Veterinarian should be available via PubMed for the years already indexed soon  Manually capturing the complete backfile for The Cornell Veterinarian to contribute to PubMed Steven Folsom, NASIG Annual Conference 2014
  28. 28. Future Considerations  Potential for improved access to other titles currently lacking full-text linking in PubMed [if in HathiTrust]  Investigations into other (non)full-text indexes and fulltext repositories  New Services for interacting with HathiTrust Digital Library  Potential improvements to the Hathi workflows. Steven Folsom, NASIG Annual Conference 2014
  29. 29. Questions? Steven Folsom, NASIG Annual Conference 2014Photo credit: ehive.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×