The Core Metadata Project at the University of Wisconsin-Milwaukee aimed to harmonize legacy metadata across digital collections for improved discovery and interoperability. The project involved evaluating existing metadata practices, documenting schema mappings, identifying fields for standardization, remediating collection records, and creating documentation. The end goal was to produce metadata that better conforms to external guidelines and enables records to be more shareable through channels like the Digital Public Library of America.
1. Core Metadata Project:
Harmonizing legacy
metadata for the future
NATHAN HUMPAL, CATALOG AND METADATA LIBRARIAN
ANN HANLON, DIGITAL COLLECTIONS AND INITIATIVES
UNIVERSITY OF WISCONSIN-MILWAUKEE
9. Core Metadata Project
Pre-planning and evaluation
◦What Dublin Core fields did we use?
◦What field names did we use?
◦What field names were mapped to what Dublin
Core elements?
Digital Collections: The Libraries at UWM have been creating digital projects since 2002 - there are now fifty-four digital collections available with more than 130,000 digital objects, and more on the way.
Scaling Up: When I arrived in 2012, we were gearing up for some major additions to our digital collections, and focusing on comprehensive, or nearly so, digitization of entire collections. We had one project underway and another in the planning stages. Both of those collections – which ultimately comprised over 85,000 digital objects combined – were image-based photography collections. In the years that followed, we also worked with Archives in a major push to digitized their oral histories and selections from their WTMJ newsfilm collection; we added a collection of Yiddish Posters, a collection focused on Latino Activism at UWM, and a collection of Chinese scrolls from our Special Collections – all collections where we created bilingual metadata records; and we added two newspaper collections – the UWM Post and an underground newspaper from the late 1960s, the Kaleidoscope. So our collection building not only scaled up, but was wide-ranging in terms of format, subject matter, and audience.
Documentation: While we had some documentation for our metadata creation, it was collection-specific and didn’t take into account the myriad kinds of collections we were creating.
Our best documentation was focused on creating geographic subject headings for the images we were digitizing from our American Geographical Society Library. This was necessary given the nature of those images, as well as the difficulty of locating and assigning accurate geographic headings. But other than that, we really were just sticking to a set of controlled vocabularies for subject headings – primarily the LOC’s Thesaurus for Graphic Materials, and reusing basic metadata templates to ensure some consistency in the way we described the original repository and location. But things like date, format, type, etc were really all over the place across collections, even if they were consistent within collections.
Discovery Layers: One major driver toward creating more consistent documentation – and harmonizing our metadata across collections, is the proliferation of alternative discovery layers for our digital collections. For years we’ve been contributing collections to Recollection Wisconsin, which has been pretty forgiving in terms of metadata consistency. Two more recent developments – the adoption of Ex Libris’ Primo/Alma for our ILS, and the opportunity to make our digital collections discoverable in the Primo interface, and the addition of our materials to the Digital Public Library of America, have prompted us to examine our metadata practices more closely, especially with regard to fields that are useful for faceting, like Type, especially.
Organizational Structure: Finally, as our collections grew, so did our department’s goals and mission. We updated our landing pages and we’re creating additional context for our collections, in partnership with Archives, AGSL, and Special Collections. We have begun to emphasize our external outreach as well, working through our Digital Humanities Lab to integrate use of our digital collections into the classroom and to discover new ways to use those materials for research. And we’ve also started a project simply to clean up the files created over the past fifteen years in order to better organize our documentation for the purposes of digital preservation as well as to set us up for more efficient project planning and training in the future.
Platforms: We’re also keeping an eye on developments with regard to digital asset management systems, data models, linked data, image exchange protocols, and data exchange protocols, among other things. We’re currently using CONTENTdm to host our digital collections and expect to for the near-term, but we are invested in following development of the Hyku platform – an open-source platform based on a Fedora/Hydra stack and that is being developed as an out-of-the-box system by DPLA, Stanford, and DuraSpace.
Problem(s) (Ann): So that’s where we’re coming from. Conceptually, I was really focused on three issues that needed attention in order for us to continue working at scale with our digital collections and to make sure we were in shape for possible future migrations and to accommodate new avenues for discovery, not to mention new staff, new students, and other unexpected twists and turns. So my concerns were (and are):
Working with a lot of legacy metadata: fifteen years of metadata and metadata experiments, all created to serve the purposes of each collection without a consistent focus on the collections as a whole, or on how that metadata functioned in external environments
Ensuring we have consistent workflows for digital collection building and for developing metadata
And updating – or in some cases, creating – adequate documentation to ensure consistency not just within collections, but across collections; and to help with training, too.
Before starting, I wanted to get a sense of the metadata across all of the collections. I wanted to answer three basic questions:
What Dublin Core fields did we use and how often?
What field names did we use and how often?
What field names were mapped to what Dublin Core elements.
Because ContentDM is structured as multiple collections, it doesn't offer a very good way of answering these questions within its client. Each collection is essentially siloed from the other collections, so cross collection analysis isn't really an option. Instead I had to extract the data and do a little bit of massaging in order to get some answers.
I built a spreadsheet with columns for collection name, field name, and Dublin Core mapping. Then I created several pivot tables to get a sense of what I was working with. I was able to determine what DC elements were used in what collections. This helped in figuring out which fields were important (for instance 'Contributors') and which were probably unimportant (for instance 'Audience').
We also decided to come up with some broad categories of the types of materials that a collection might have to help us organize how we treated those different material types.
What I found while looking through our past metadata was a lot of confusion between Dublin Core Type, Format, Medium. We decided to ensure that there would be more consistency with how these elements were used and what vocabulary was used within them.
As a result I focused on organizing a lot of the subsequent documentation around DCMI Type, though we had to augment those categories with some RDA content types. We’ll get a little more into how this played out when I talk about documentation.
We wanted to stick to external guidelines as much as possible, so that we could be on the same page as other institutions.
DPLA (Recollection Wisconsin, Madison pipe). The Recollection Wisconsin pipe has certain requirements so that it can consistently transform data to adhere to DPLA requirements.
RDA. Looking towards RDA for guidelines for the data allows us to more closely adhere to how our physical material is being cataloged.
Dublin Core. We looked at Dublin Core’s documentation to make sure that we were meeting their expectations for what the different fields were designed for.
ContentDM. Of course, ContentDM itself creates restrictions on how we can adhere to all of the above guidelines. For instance, we can’t create particularly strong relationships within the data, especially across collections with ContentDM’s architecture. And, of course, it doesn’t allow for any linked open data content.
*Click*
That being said, we did try as hard as possible to set up an environment that would make a transition to a linked open data environment as painless as possible. We sought vocabularies that were published with LOD in mind, and kept track of the URIs for the terms we used.
problems with maps in Primo. One of the biggest impacts that we could identify was collocating our digital maps and our physical maps in Primo, our discovery layer. In order to do that we needed a consistent way to identify maps in ContentDM and make sure that Primo knew about it so it could facet that material in the same way that it faceted physical maps. This was one of the big reasons that we refined DCMI types to RDA content types in some cases.
inconsistent type in contentDM. And as mentioned earlier, we noticed a lot of inconsistency in how Type and Medium were being used. Since resolving the map issue required dealing with Type, we decided that focusing on this inconsistency was probably the most important.
For our documentation I kind of came up with a three tiered approach: General metadata guidelines about what Dublin Core terms were required, more specific requirements by type of material, and then specific application profiles for each collection. Within there, we realized that there could be a broader category before specific collections if we knew that several collections were going to have similar elements: for instance, oral histories might have generalized guidelines.
Core metadata fields.
*Open up Core Metadata document and navigate through it*
The core metadata fields document required fields, required if applicable fields, optional, and fields that we shouldn’t use (and alternatives to those fields).
Required fields broken down by type.
*Open Type Document*
The required fields broken down by type, was then created so that when you are assessing the metadata needs of a new collection, you can figure out the different types of material and then create an application profile that uses the required fields from each different type of material in the collection.
For instance, if we had a collection with Still Images and Text in it we’d need
Date field
Identifier field
Rights field
At least one Subject field
Title field
Type field labeled Type (DCMI), with either Still Image, or Text in its contents
Description field with ‘Color/B&W’ as its label
And a Description field with ‘Extent’ as its label.
Note that these are only the required fields. So depending on the content, we might want a Creator field, a Medium field, or a Language field.
Before we started the actual remediation of data, we first needed to make sure that the fields in the different collections adhered to our guidelines. So Anne and I went through each of the collections maps in the Administration module and changed things around. We needed to be cognizant, of course of what we were changing and if that data needed to be remediated by a cataloger. So this field remediation served not just as an initial step in remediating the collections, but also as a triage in how to start remediating the collections.
Assigning catalogers: One of the great things about this project has been the chance to really work across departments – in this case, training the original catalogers to work with our digital collections. We divided up the collections according to size so that each cataloger got approximately the same number of records to work with, though some collections are more homogeneous than other collections, so workloads inevitably varied despite our best efforts.
Assigning catalogers: One of the great things about this project has been the chance to really work across departments – in this case, training the original catalogers to work with our digital collections. We divided up the collections according to size so that each cataloger got approximately the same number of records to work with, though some collections are more homogeneous than other collections, so workloads inevitably varied despite our best efforts.
Example (working within ContentDM client): Everyone needed to work with the CONTENTdm desktop client to do batch updates, or item-level updates. It’s an offline client so they could make changes and Nathan and I would approve the changes before they went live.
Easy stuff included assigning a Type DCMI of Still Image to photographs – we have loads of them and it’s a pretty non-controversial designation
Hard (granular vocabulary, compound objects): Harder stuff included assigning different types to multiple parts of a compound object; and assigning format descriptions could become more complicated than our vocabulary list indicated. But it was a group effort and we tried to avoid going too far down any rabbit hole.
Type, Medium, URIs, etc.: Vocabulary lists were important as there are many sources for controlled vocabularies that are legit for lots of different types of mediums in particular. For consistency we needed to agree on which term we preferred and why. That list is based on the DCMI Type to help organize terms logically. We built out terms initially that we knew dominate the collections. Things like “nitrate negatives” and “manuscripts”. Where we had far more variation – or growth of variety – was in the Type: Genre field.
Because of our running Vocabulary list, and the documentation we created, quality control was fairly straightforward.
First we went through the field mapping to make sure that it conformed to what we had created in the documentation.
Then we ensured that the vocabulary conformed to the vocabulary list we had created. That was pretty easy: since we had maintained a list as we went on, we could just create a vocabulary list in ContendDM and see if the field conformed to that list in each collection. It often didn’t, but that’s why we did Quality Control.
Further Remediation: So we still have more remediation to do, of course. Having finished up Medium and Type is great – and gets us pretty far, really, as these are fields that need consistency across collections and consistency with external collections as well.
Date is our next frontier. Because we’re working primarily with archival collections, we have lots of items that have unknown dates, circa dates, and date ranges. Dates have been entered inconsistently over the last fifteen years and it’s another important field for sorting, faceting, discovery, and administration. We’re struggling here with some of the limitations of CONTENTdm and with the proper ISO form for date – Nathan can talk a bit more about that, but it’s the next step in our actual remediation efforts.
Application profiles for more specific genre types – such as oral histories, newspapers, photograph collections, etc – are another step that we’re eager to implement. So this will be another frontier in our ongoing quest to create useful and necessary documentation based on what we’ve been able to do in terms of remediation and harmonization.
We’ve had the good fortune to have some very talented students working for us over the past few years and one them, Charles Hosale – who has since moved on to a position at WGBH – created an application profile template for oral history collections that we will adapt for other genres. We think genre is the most logical way to create application profiles as most of the really significant differences in the way we structure and describe an object is based on its genre type. This is an example of the application profile Charles created…
And a little closer look to see that what we’re doing is indicating a field type, what it means, whether it’s required and repeatable and where to find the values that are allowed in this field, as well as where to map it for Dublin Core. So making our collections as predictable and consistent as possible.
Final thoughts: The core metadata project is ongoing. We’ve really only kicked it off, though with a lot of thought and systematic identification of the most impactful categories of data to remediate first – and so that’s why we’ve focused on Type and Format initially. The focus has been on creating consistency, coherence, and conformance to standards. And applying that to fifteen years worth of collections. We’re looking for two main outcomes: improved metadata for our digital collections that, especially, makes them function well not only in their native system, but in other portals and platforms, and alongside other materials as well; and better and complete documentation to ensure consistency going forward.
Questions..