So, hello everyone. Thank you for coming together to meet with me today. I’m going to try my best to supplement your topic this week: exchange and quality control for metadata with some real world examples, and then we should have plenty of time at the end for questions.So, of course, first a brief introduction of myself. My name is Meghan Finch and I am the Metadata and Digital Media Librarian here at Wayne State. I am also a graduate of the Wayne State School of Library and Information Science. I started in this position in March of this year, so at this time I am about 8 months into the job, though I did work on metadata projects here at Wayne prior to receiving the actual title. Because I received my position fairly recently, I still have my job description as it was provided to me in the job application, so I’ve summarized that for you here. I’m basically responsible for providing access to archival and special collections and digital projects that come from those collections. I work with what was formerly the Digital Library Initiative Team, although a recent reorganization of the library system has changed that up a bit. And I specifically work with the tasks involved in automating the process of electronic theses and dissertations ingestion to both our traditional catalog and our institutional repository.
And just to give you a quick look at what I’ve done in those 8 months, I’ve got a quick list of some projects that I’ve pushed forward and some of the tools that I use to accomplish those tasks. I’ve worked with the Scholarly Communications Librarian to help automate the task of getting our electronic theses and dissertations available in our institutional repository and our catalog. I’ve worked on a project to migrate more than 12 large digital collections, one collection which contains more than 100,000 records, to another platform. And more recently, I’ve been working with some oral histories that were created for the Cass Corridor. Some tools that I work with include MARCEdit, Oxygen XML editor and its included XSL processors, Omeka, a collections platform I’ll be discussing a little later, and really, you would be surprised how much of my effort includes dealing with data in either excel or filemaker formats, most of my time spent trying to get it out of those formats.
So, we’re discussing exchange and quality this week, and the first topic that I’d like to address is the access and exchange part of metadata work, and then that will lead us to some quality assessment. Here at Wayne State, we have two different platforms for our electronic stuffs: we have our institutional repository and we have our digital collections. The line that has been drawn by the Wayne State Library System is that the institutional repository holds things that are created at Wayne State by those associated with Wayne State, while our digital collections are often projects that deal with things that we collaborate with others to present. I’ll start first with our institutional repository. We use Digital Commons, a suite provided by Bepress, an arm of University of Berkeley. It’s not a free service, it’s one that we pay a subscription fee for. In that subscription fee, we get a fully hosted institutional repository solution, that includes support for search engine optimization, design, and preservation.In terms of exchange and access, Digital Commons really promotes itself as a solution for making its contents accessible. Search engines crawl all of the content in DigiCommons, and I believe that at least 70%, if I recall the latest stats correctly, of our hits to items in the IR come from Google.Digital Commons is an OAI accessible repository. I’m sure you’ve probably heard of OAI, but just to drill it in a little more, since it is an important resource as a metadata librarian, OAI-PMH is what we are interested in here, and that’s the open archive initiative’s protocol for metadata harvesting. It’s a framework in place that makes it easy..er, to exchange metadata in community shareable formats. So any item’s metadata can be harvested as oai_dc schemed metadata out of Digital Commons, which Bepress does the work of mapping their unique metadata elements to the defined oai_dc schema that is required for OAI. Which, to be clear, is just the standard 15 Dublin Core elements.And we have the ability to provide access to a lot of different types of data sets and content in Digital Commons. We have our Electronic Theses and Dissertations in here, we also have some WSU Press journals in here and are looking for open access journals at Wayne to add, and we can preserve and provide access to events as well. All of them findable on the web, archived, and preserved.My involvement with Digital Commons has been limited so far. We have had a Scholarly Communications Librarian in place for a while, and he has formerly taken on most of the communication and planning with Bepress. The question of who is creating the metadata for Digital Commons, that depends on the type of project. If the question is more directly, does a metadata librarian create the content for Digital Commons here at Wayne, the answer is no. For example, the students submitting their theses and dissertations enter the metadata themselves through the interface made available by Proquest. And I did want to talk about how that system is put together, because there’s a ton of metadata exchange that goes on here that prompted some of the work that I’ve been involved in.
So, the graduate students submit their work to the ProQuest Dissertations & Theses Database through the Graduate School. The fields that they use are metadata fields defined by Proquest and their internal schemas.Proquest creates a file bundle that they FTP over to us. These zip files include the pdf copy of the article, and an xml file that contains the metadata for the article, in Proquest’s own xml schema for theses/dissertations. So we have an interesting exchange so far. The metadata is submitted by the author of the piece to Proquest. Proquest sends us a copy of that submission, formatted in their proprietary format. It goes from us, to Proquest, and back to us, and we must then exchange it with Digital Commons, which means transforming Proquest’s xml schema to Bepress’ xml schema.So this is a project that I was able to contribute to. The University of Iowa published some xsl that they had created with the purpose of taking Proquest xml and converting it to both Bepress xml for ingestion into Digital Commons as well as into marcxml to eventually be ingested into an ILS. (I’ll share a link to those later for those interested)I took a look at those xsl transforms, and from these we created two major new things that improved upon the process.The first was to write a shell script that automated the entire process. Once we receive the zip files from Proquest, a script automatically unpacks them, does some character encoding changes that are necessary, completes the xsl transforms, and even runs MARCEdit at command line to get the MARCxml into an actual marc format.
The second task was a more labor intensive one, but definitely something that you find as you are exchanging data. When we receive the Proquest metadata, Proquest assigns categories to the articles that fit their own controlled subject vocabulary. Bepress, of course, has its own controlled subject vocabulary as well. We needed a way to change the subjects from Proquest to Bepress. And so I undertook the task of creating an xml subject mapping. This map is external from the xsl, but it is called in the xsl when our script is run. It compares the Proquest subject numbers assigned, and then maps it to the closely equivalent subject in Bepress.We try to automate everything as much as possible, but the initial building of the subject mapping was one that took time. I had to first acquire the taxonomies from both institutions, and then basically go through the Proquest subjects one by one to map them to the Bepress available subjects. Do I think there may have been ways to automate this as well? I do, but the problem that we suffer a bit with this is that many of the subjects are not exact equivalents. Some Proquest subjects are more specific, some less specific, and so a certain amount of human intervention was required, regardless, to make sure that these mappings made sense. I think this project is a good example of an exchange that will not be purely OAI Dublin Core type of trade. I think you find often in this job that you aren’t necessarily working within those ideal open archive schemas to share to the world, and often you’re just trying to exchange data inside the institution from one platform to another. And though I’m aiming to address metadata exchange here, I think the subject mapping is a fine example of the quality control you may need to deal with as well. Here, my map needs accuracy. That’s what will make this quality data. I need the values to all be exactly correct, down to the case of the font, and I need things to be correctly mapped together to complete the exchange.
So then, how about what we do for access and exchange with our digital collections. Currently Wayne State uses DLXS, or the Digital Library Extension Service.DLXS is open source software offered by the University of Michigan. DLXS was conceived in 1996… Yeah. I feel like that might be all there is to say about that. I mean, in seriousness though, it was created as a suite of tools and middleware to enable institutions to share their digital collections. And U of M used this for some of their really big collections, the Making of America project and the Humanities Text Initiative. It’s also designed with EAD in mind for finding aids. University of Michigan, up to this year, has been maintaining and improving DLXS with new and better features. DLXS is capable of sharing its data through OAI. The schema is versatile and will really hold any metadata schema that you wish to implement. But the tools are still fairly old, and this year U of M announced that they will no longer be developing for it. The way that the data is stored in DLXS, it is not readily available to be crawled by search engines. Loading data, adding data, activities like that, are very difficult in DLXS.Our DLXS implementation is used for digital collections of content not necessarily created at or by Wayne State University. For example, we host the Reuther Archive’s “Virtual Motor City” collection, which is a large collection of photojournalism originating from the Detroit News archive. We also have a digital dress series, a collaboration bringing together costume collections held at many different cultural institutions around the Detroit area, including the Detroit Historical Museum, the Henry Ford, and a few others. This also includes special collections held in the WSU library collections, but not created by WSU affiliation. For example, we have a collection of letter and ephemera by Florence Nightingale featured in our digital collections.Who created the metadata? A lot of people. Students, professionals at those submitting institutions, volunteers. The amount of metadata and its quality are very diverse in the digital collections.I have a couple examples to try to showcase the variety of our collections. The first instance is from Virtual Motor City, our largest collection. These fields are a mishmash of all sorts of “authorities.” The contents of the collection are originally from the Detroit News archive, which, if what I have heard is correct, is where the original Filemaker Pro database of the items was created. That database, along with the physical collection, moved over to the Reuther archive, where student workers were employed to select and scan images, as well as improve metadata as possible. I believe metadata was added by Reuther archivists as well. The data then headed over to the library, where the former metadata librarian also worked to improve the metadata. So, the fields you see are the long process of a lot of hands.My next item is from the Detroit Historical Museum’s Costume Collection. I apologize that at this time I don’t have as much historical background on this collection as the VMC, but you can see the fields selected by Detroit Historical to describe this particular collection, which is one of several textile ephemera collections that are contained in Wayne State’s digital collections. They differ from the VMC in some ways, and are similar to it as well.We have the OAI open to share the data by OAI_dc. We also are harvested by OCLC’s WorldCat Digital Collection Gateway, and many of our collections are featured through DALNET. Each collection has a unique set of metadata elements that are available. These are decisions that were made prior to my appointment, but I can say that currently, each collection uses the fields that were created by the submitting institutions, and those are later mapped to oai_dc for sharing through the DLXS interface.
So here is a bit of a comparison between our two platforms. Both platforms are fairly flexible in their metadata fields and offer OAI exchange. A problem that we saw, however, was that DLXS was not being found through Google or any other webwide search engines. This problem, along with several other issues of loading data and ease of use combined with the knowledge that U of M would no longer be working on the platform, was enough information for the library system to decide to evaluate other available platforms to replace DLXS. Right before I joined Wayne State, a team was formed to assess different platforms for digital collections to find that replacement. Criteria was developed by a team of librarians who contributed what they thought would be necessary or beneficial features, but one of the major important factors that weighed in was search engine accessible. Access and sharing were the most highly weighted criteria.The platform that was chosen was Omeka. It’s open source, designed for Dublin Core but built to accommodate other schemas, and flexible. And, search engine crawlable.
So, with these in mind, I set out, with the feedback from the Digital Migration Team, to determine what schema we would use with Omeka, what controlled vocabularies would we expect from future projects, and what we were going to do with all the data that we have from previous projects that do not comply to the controlled vocabularies selected.Our goals here in the assessment were to select a schema and controlled vocabularies that could be used by others when harvesting our data and interpreted and mapped by their systems if needed, and also to determine which of those schemas and controlled vocabularies were going to work for our data, both what we have now and what we want for the future.Because of the size of our collections, we couldn’t you know, just take a day to look through each metadata record and make a decision from that. So we decided to take a sample of about 5-6 records from each collection. We tested out each collection mapping them to Dublin Core, Qualified Dublin Core, and VRA Core.The selection of our platform influenced what schemas we wanted to test. Omeka is open, you can put any schema you want into it with a little PHP knowledge. But it is designed with Dublin Core in mind. The element set that comes with Omeka standard is the 15 base Dublin Core elements, and one of the better plugins for Omeka is the Qualified Dublin Core extension. So, I guess you could say that we wanted Dublin Core to work, for various reasons. But we were also aware that our collections were diverse and may really have needs beyond what Dublin Core could provide.And following this initial assessment, we then determined that any changes that we make should be made through batch processing if possible, given the size of the existing collections. So in that stage, we would need to assess tools capable of making the changes that we require.
Like I mentioned, one of the first things that we did when we started talking about migrating the existing digital collections to Omeka was to take a sample from every collection that we have, and try to crosswalk that to Dublin Core, Qualified Dublin Core, and VRA Core. This was an effort to get an idea of what type of data we have, and also a chance to get a cross section of what we might expect while reviewing the data.What we found were that we had several kinds of issues with our metadata. In most cases, controlled vocabularies were not used. The variety of values found in an element like “subject” varied so greatly across collections, even just the concept of what a subject was, that we felt that would need work from us to improve quality.Another big issue that we saw with our collections was not so much about controlled vocabulary and more about values that were confusing.And I thought I’d provide a little example of just, some of the things that we found that we felt needed correction. And in this example, a value for the element “donor” was entered as “see above.” You can also see the comments that me and my other colleagues left there for each other regarding the value. I mean, besides the fact that a statement like “see above” has no meaning in an online database environment, I find it very fun that none of the other test records we pulled have a value at all, so it doesn’t even seem like there could be anything to “see above.” We found many instances of outdated terminology that did not translate in an online environment, descriptive titles that provided incorrect information, all sorts of stuff. I think it was at this point that we determined that it would be worth it to go through each collection before loading it to Omeka. We decided to review the values for adherence to controlled vocabularies and standards of data entry. And this comes from various sources of standards. One decision that we made here was to try to be flexible, because our data is so varied, because it comes from so many different environments, and because we didn’t want to be overly rigid and say “everything has to fit this,” you know, kind of like cramming other body parts into a pair of shoes because we’ve got one pair of feet. So Library of Congress Subject Headings, Art and Architecture Thesaurus, MESH subjects, FAST subject headings, those are all open game at this point for our collections.But in areas where it did not serve to be flexible, like the format for a full date or a geographical location, we had to make firm decisions as to what standards we intended to adhere to.We found that Dublin Core, especially if using the larger qualified set, worked well with most of our collections. Where we did find Dublin Core lacking, however, was in its ability to refine elements of location and dates, or temporal and spatial coverage, to use the Dublin Core terms. Too many of our fields fell into the temporal or spatial cover elements of QDC, and without being able to refine them, they started to not make sense.VRA Core, on the other hand, had plenty of refinements for location and dates.
And what we decided on creating as our schema was a metadata application profile. You probably came across this in your readings, but to define in brief, an application profile allows you to take the elements from various metadata schemes that you need to make a more complete record for your data. There are two really prominent models for application profiles in the field right now. Those are the Dublin Core application profile model, and METS, a container schema designed specifically to contain several metadata schemas by the Library of Congress and the Digital Library Federation.We decided to stick with Dublin Core and Qualified Dublin Core as our main elements, and elected to use select elements from VRA Core, specifically to create more robust location and date elements.We also decided at this point to support various controlled vocabularies for subjects. However, we’ve narrowed the date field to use W3CDTF, as recommended by Dublin Core standards. This will mean that we will be making major changes to date formatting in all of our previously acquired data. I’m hoping to be able to use some tools to automate this process, but we’ve yet to really get to that point yet.Working with VMC metadata currently, I’ve also not yet made a decision on our intended geographic location data. There are several sources for standards, including ways to include latitude and longitude, so I will begin working with that once we have a collection with need for that data manipulation.
So, now I want to talk about being OAI compliant and how that works with an application profile. to be OAI compliant, you are required to have oai_dc metadata fields mapped to your data that you are sharing. These are the basic 15 Dublin Core fields without any qualifiers. But the thing is, it is not the only way to share your metadata through OAI, it’s just the required. Through OAI you can share basically any metadata schema, including METS application profiles.The nice feature of OAI in Omeka is that your OAI isn’t mapped until it’s on its way out. In other words, I could format all of my data as VRA Core. But in the OAI plugin, I need to only map my VRA Core to their equivalent oai_dc element (if they have one), and that will be available to harvest in that format. At the same time, I can share the VRA Core as VRA Core. Or, I can add elements in Dublin Core, Qualified Dublin Core, and VRA Core, and share them all in a METS profile, and share that through OAI. So that is what I have been trying to do as we develop Omeka, our metadata schemas, and the ways that we exchange and share our data.I feel like providing what is required in a simple format as well as offering a more complex piece of data, we can provide greater quality to harvesters.
Our first test collection is Virtual Motor City. We selected this collection because it is our largest and will hopefully give us a good sample of material for us to test, find issues, and resolve. The issues that we found with VMC include:Incorrect information in the descriptive titles. Basically, the photos from the Detroit News came in envelopes, and each envelope had something written on it, that was kind of a title, kind of subject driven, and often wrong. There were several photos in one envelope, so we’d had some issues like, the envelope would say it contained “St John’s Church, interior and exterior.” On an individual photo, having that be the title, when it is only a photo of the inside, is misleading. We received an email last week from a user who found a photo of Mayor Cavanagh titled as “Governor Romney.” Unfortunately, at this point, this is really an issue that we must address record by record, rather than in a batch.No subjects for many of the objects.Character encoding chaos. VMC started as a Filemaker Pro database. Filemaker is a little finicky about how it is willing to export data. For the comma separated files we wanted, we had only the option of UTF-16 rather than UTF-8. So once the objects were exported from the database using an Applescript developed by the Digital Initiatives Team, I then had to work to correct all of the character encoding issues to get it to UTF-8. For this, I used a combination of character encoding scripts in Notepad++ and Excel macros.We’d also really like to get a handle on single controlled vocabularies for our types and formats. Once we select the vocabularies we intend to use, we are interested in using tools like Google Refine to automatically assign and map these vocabularies using linked data sources.And for the future, I must say that one of my pet projects would be to test out transitioning from Library of Congress Subject Headings, which are currently applied to the VMC, and moving those to FAST, or the Faceted Application of Subject Terminology, developed by the Library of Congress. FAST is designed to function as facets, which Omeka has a plugin as well to facilitate. OCLC has a new prototype out that allows for automatic conversion of traditional LCSH to FAST headings in MARC. That would be interesting to test, and see if we could get that to work either pre-ingest into Omeka, or if we could even get a plugin working in Omeka that would break down the subjects into facets during or post-ingest into Omeka, with other types of data other than MARC.Another project for the future is administrative and preservation metadata. There is not a lot of data currently available for these collections that describes where they’ve gone, what they’ve been stored in, who’s added things to the metadata record. And then we have little info for retention and preservation.
Okay, and that’s about all that I’ve brought to talk about today. For a list of references to some of the resources I’ve mentioned today, feel free to scan the QR code here or follow the link below it to a shared Google Doc. And I’ll check now to see if anyone typed out questions while I was talking.And then, I think we’ll open the floor for questions?
Guest Lecture: Exchange and QA for Metadata at WSU
Exchange & Quality Control for Wayne State Metadata Meghan Finch Metadata Librarian Wayne State University11/21/2011
Meghan Finch• Metadata and Digital Media Librarian• Started in March 2011 (8 months into the job)• Job description: – Managing and contributing cataloging and metadata for archival and digital projects – Creating Encoded Archival Description (EAD) finding aids and associated style sheets for archival collections. – Working collaboratively with the Digital Library Initiative Team to describe digital objects and create metadata formats appropriate to various delivery platforms – Participating in library teams and special projects related to cataloging, digital projects, and bibliographic control – Converting and preparing metadata for Electronic Theses and Dissertations (ETD) for import to the digital repository11/21/2011
Projects I’ve worked on, Tools I useElectronic theses and dissertations (ETDs) altered existing XSLT to transform data from Proquest to bepress for ingestion in contributed to shell script to automate production of bepress xml and MARC records working on new XSLT to convert existing MARC records to bepressDigital Collections migration assessed existing metadata for ~12 collections selected metadata scheme(s) for use in new platform adding descriptive metadata to improve collectionsOral histories MARC records Converting files for use online 11/21/2011
IR vs. Digital CollectionsDigital Commons DLXS• Search Engine Optimization • Is NOT crawled by search managed by Bepress engine remotely • Flexible metadata• Flexible metadata schema schema, ugly and cruel load (Bepress created and tables managed) • OAI repository• OAI repository11/21/2011
Standards and Guides I turned to• CDPDCMBP (the most incomprehensible of all acronyms!)• The sources of the schemas: Dublin Core, VRA Core, METS, etc.• NISO• Controlled vocabularies: LCSH, AAT, DCMITYPE• Digital initiative wikis: Ball State11/21/2011
Process of Evaluation• Determine standards that work for users AND work for the collections• Take a sample of existing metadata and test out standards• Find ways to make changes in batches, not individual records. Changes through data manipulation, not data entry.11/21/2011
Detroit Historical Dorthea JuneWSULS Field Description CFAI Field Notes AFT Field Notes Costume Field Notes Grossbart Field Use for general people, corporations, CONTRIBUTOR organizations that were involved in the CONTRIBUTOR2dc.contributor creation/developement of the item CONTRIBUTOR3 Use for general, non-specific locations, dates and times related to the resource. See dc.date, dcterms.spatial, dcterms.temporal and the VRA location elements for more specific dates anddc.coverage locations coverage Use for the creator(s) of the item. Agents not considered responsible for the creation of the item can be CREATOR described by the dc.contributor or the CREATOR2dc.creator more specific marc relator codes CREATOR3 dc: creator Use for date of creation. Can be ORGDATE EARLYDATEdc.date specific, range, bulk orgdate2 dc:date LATEDATE DJG_do DJG_cl Use for general descriptions of the item; DJG_co an account of the content of the DJG_dldc.description resource DESCRIPTION dc: description DESCRIP DJG_ds Use for any alpha-numeric identifier used with a physical object. Do not use with identifiers specific to the digitaldc.identifier object. identifier dc: identifier OBJECTID DJG_fn Use to indicate the language(s) of the resource. Not to indicate the language languagedc.language of the descriptive metadata language2 PUBLISHERdc.publisher Use for the identified publisher PUBLISHER2 dc: publisher Use to describe general relations. See DCTERMS relations for more specific relationdc.relation relations relation2 Use to indicate rights. Can include name of rightsholder, contactdc.rights information rights dc: rights Use to describe another resource from which the source is derived (i.e. the title of the journal where the article appears, the title of a book from which the imagedc.source was scanned) source dc: source COLLECTION Use for subject terms. Keywords or SUBJECT1 not sure if dc: GPARENT controlled vocabularies. Recommended SUBJECT2 dc: subject relation is really a PARENTdc.subject LCSH, AAT, MESH SUBJECT3 dc: relation relation SUBJECTSdc.title Use for the primary title of the item. TITLE dc: title Use to describe the nature or genre of the resource (i.e. photograph). Do not typedc.type use for file type type2 dc: Item_Typedcterms.alternative Use for alternative titles OBJNAME Only one example to base on. Could be a dc.description Use to provide physical description. Can if all occurences of include measurement, dimensions, element are notdcterms.extent page numbers, number of pieces DIMNOTES heel measurementsdcterms.isformatofdcterms.ispartofdcterms.isreferencedbydcterms.isreplacedby 11/21/2011dcterms.isrequiredbydcterms.isversionof Use to describe the material or physical DJG_fcdcterms.medium carrier of the resource DJG_fi
Omeka Dublin Core Item Type Metadata WSU Marc Relator Codes WSU VRA Core11/21/2011
The Current, the Future• Controlling the vocabs with: – Google Refine – Excel – MARCEdit – Linked data authorities• Admin & preservation data• FAST subject headings11/21/2011