Using Open Source Tools to Improve Access to Oral History Collections                                                   Be...
Upcoming SlideShare
Loading in …5

Poster: Using Open Source Tools to Improve Access to Oral History Collections


Published on

Presented at the Library Technology Conference 2011 in St. Paul, MN.

Program Description: Oral history collections provide a wealth of information, yet current practices in metadata creation and
access limit the amount of information within the interview transcripts that can be discovered. This
poster describes the Miami University Libraries current project of using Open Source Software in
creating enhanced access to our Oral History collection. The Oral History Project at Miami University contains over 100 interviews pertaining to experiences at the University, with transcripts for over half of
the interviews. The poster will describe the process of batch processing transcripts using OpenCalais, a web service that automates the creation of metadata for content using natural language processing and machine learning, and displaying both the transcripts and metadata in the content management system Drupal using various modules. We will discuss the results from the comparison of machine generated
and human generated metadata in this project and the benefits and concerns surrounding both methods. Future project developments will also be included.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Poster: Using Open Source Tools to Improve Access to Oral History Collections

  1. 1. Using Open Source Tools to Improve Access to Oral History Collections Becky Yoose, Bibliographic Systems Librarian and Jody Perkins, Metadata Librarian Miami University Libraries, Miami University, Oxford, OH Migrating transcripts into Drupal & batch processing using the Calais Drupal module Miami Stories Oral Outcome History Project in six [oversimplified] easy steps! OC can provide substantial efficiencies when • Begin in 2005, coordinated by University working with large volumes of full text especially Archives, CONTENTdm collection maintained for collections where terms representing people, by Digital Initiatives organizations, facilities and locations that are • Current and former students, faculty, and staff, deemed critical access points. as well as friends of the University share recollections of their Miami years Possible Next Steps: • 100 videotaped interviews Average length of interview: 2 hours Data quality study: Measure data quality using Half have been fully transcribed established criteria for: • Aboutness / substantive coverage • Accuracy OpenCalais • Completeness Step 1: Export XML from CONTENTdm to MySQL Step 2: Import MySQL table using Table Wizard • Context • Released in 2008, used by various companies, Step 3: Migrate content into Drupal using • Consistency news agencies, and publishers Migrate • Interoperability • Uses natural language processing and machine • Usability learning to extract categorized metadata (in RDF format) from full text documents Integration, display, and sharing: Currently the • API, modules, applications available for different Oral History Project is hosted on CONTENTdm; platforms however, the Libraries are in the process of migrating several collections to DSpace. In light of this move, the metadata generated from this project,+ Drupal along with the videos, transcripts, and descriptive metadata, might be calling one of the following • Popular Open Source content management platforms “home” in the near future: system (CMS) built with PHP • Drupal • Used widely for web sites and blogs • Omeka • Flexible and customizable, over 8,000 modules Step 4: Edit Calais node settings NB: We set the Relevancy Threshold to return the maximum number of terms for the project. Step 6: Profit! • OpenWMS Miami Stories OpenCalais Pilot penwms/ Step 5: Batch process transcripts using Calais Human metadata creation workflow: Each interviewer had a cover sheet to list key terms and topics relevant to the interview. These terms were Want to learn more about entered as keywords into item records and Observations the technical details of this supplemented with FAST (Faceted Application of Subject Terminology) headings - a controlled  OC generates a much larger number of access points, but OC results also included a larger number of false hits/inaccuracies project? Scan the QR code vocabulary based on Library of Congress Subject  OC categories provide a less granular browsing structure or visit Headings and related LC Authority files.  Terms representing contextual and relational information are lacking in OC results for more information!  Certain aspects of the OC schema don’t suit the content (many irrelevant categories) and there are numerous gaps when compared to the Human metadata creation issues: Data collected on cataloger created metadata cover sheets varied with the amount of time and  Meaning of many OC categories is ambiguous making index terms difficult to interpret number of staff available for a given interview – the  Preservation and genre metadata not captured (since OC only processes text) sheets varied from having no data to over 50 topics  Subject indexing seems to be a weakness of OC – it only generated a few very broad terms, though it did so with a great deal of accuracy For further information for a single interview. Name entries even for  Name indexing (people, organizations, facilities and locations) seems to be a real strength of OC interviewees were inconsistent. The Libraries did Miami Stories Oral History Project not have the staff to manually go through 60+ Sample of name entries issues Sample of assigned subjects interview transcripts to manually extract metadata. OpenCalais OpenCalais Data CONTENTdm topics OpenCalais tags OpenCalais CONTENTdm Name Entries Name Entries Quality Issues • Anti-Vietnam War • War Pilot project goal: Experiment with applications that Charles Wilson Wilson, Charles Even though the protests • The Organ (false hit) automatically generated index terms from full text • Black Student Action • Music Drupal as more efficient way to provide access points for Curtis Ellison Ellison, Curtis [interviewer] unqualified following table shows Committee • Education this collection at the item level. OC indexed other some of the name entries • Butler Co. Sheriff • Jim Zwerg (mis- Ed Branch Branch, Edgar Marquess, 1913- issues from OC, it should • Faculty leaving the categorized) form The OpenCalais API offered a number of OC missed first be noted that OC on University • Politics Becky Yoose, Bibliographic Systems Librarian Etheridge Etheridge, Robert average created more • Faculty Senate • Religion advantages that made it ideal for this purpose. The name Hitlers "Mein name entries than • Gentle Revolution • Technology (false hit) faceting framework it employs makes use of N/A OC false hit • Long hair and beards on Jody Perkins, Metadata Librarian Kampf” catalogers. categories that are essential to these kinds of men historical collections – names, places and locations Roland Delattre Delattre, Roland • Rowan Hall occupation in particular. Roland DeLattro N/A OC indexed • Voices of Reason spelling error