The document discusses data quality issues that arise from aggregation at large scales, like the Digital Public Library of America (DPLA). It notes that aggregated data is often heterogeneous, relies primarily on basic metadata rather than text, and comes from various sources without consistent standards. This can lead to technical problems like a lack of normalization and content problems like meaningless, missing, confusing or incomplete values. It proposes several initiatives could help address these issues, such as analyzing data quality in the DPLA workflow, reviewing mandatory elements, and taking inspiration from Europeana's Data Quality Committee.
7. QUALITY IS CONTEXTUAL
What is the “context” of aggregation? Specifically,
DPLA’s aggregation…
• Heterogeneous
• Basic metadata
• Reliance on metadata vs. text
• Reliance on item-level metadata
8. DATA ISSUES IN DPLA
Content Issues
• Meaningless
values
• Missing values
• Confusing values
• Incomplete values
Technical Issues
• Granularity
• Inappropriate
values
• Lack of
normalization
• Noisy data
• Lack of standards
10. DPLA & DATA
QUALITY
Data is
robust
Descriptive fields
are present and
have meaningful
values
Required properties have
meaningful values
Data adheres to standards
All data is normalized in terms of punctuation,
presence of noise, etc.
Required properties are present and semantically
correct
Technical
problems
Content
problems
Content
quality
15. EUROPEANA DQC
Data Quality Committee (DQC) formed within Europeana
• Reviewing mandatory elements
• Data checking and normalization
• Evaluation of meaningful metadata values
• Quality of content
• Coordination with other quality-related initiatives
I’m going to wrap things up by taking a look at quality from the perspective of DPLA, from the perspective of an aggregator.
I was recently at an unconference – the kind where you get together in the morning and decide what you want to talk about. Several colleagues and I proposed a session on data quality and one of them, I believe it was Mike Giarlo, proposed the following for a title
If we all use standards, why is the data so crap in the end?
But I think that isn’t quite accurate. In fact, I think the data isn’t crap in it’s proper context.
Because in fact, quality is contextual
Let me show you what I mean
Here is a record in DPLA. It happens to come from UNC Chapel Hill, although I don’t mean to single them out. This record has a kind of peculiar and generic title, no date, minimal description, no subjects.
By our metrics of quality at DPLA, this record isn’t so hot.
However, in it’s original context, this record benefits from the fact that it is part of a finding aid. It isn’t really meant to be thought of as a single record, or at least it wasn’t created that way.
The finding aid has lots of information about this and all of the other images as a group.
But the finding aid works in its own context, when it is viewed as a description of an entire collection. When you take the record out of that context and put it in DPLA it suffers.
This is a particular kind of context related issue, but it isn’t the only one. It’s the same for local subject headings, very granular standards, very discipline-specific standards. For example, a record for a film might have specific roles for contributors: directors, producers, costumers, actors. But in the aggregation context that nuance gets flattened down to contributor.
If we want to improve records in DPLA, I think we need to acknowledge that it is a different context and talk about what that means. We shouldn’t imply that quality is a single standard that will fit all, but that there is a definable standard for what quality is within the DPLA context.
When partners want to have their records work well in the DPLA context, they need to know specifically what those aggregation-context quality characteristics are, as opposed to what may be a characteristic of quality at their home institution.
So if we were to start to define the aggregation context, we would say, first of all, that this is a very heterogeneous environment. Not every aggregation is hetereogeneous, but DPLA is. We do have some areas we don’t collect, like scholarly journals or finding aids, but generally speaking we take in a lot of diverse content.
We also rely on fairly basic metadata. It’s probably most closely aligned with the Dublin Core terms namespace, or qualified Dublin Core, but it was not developed to capture nuanced, domain-specific information. To work well in this context, data has to work well within the simplified context. If a record for a biology monograph has several hundred taxonomic names indexed to it, and those can’t be mapped to something like subject, well then that data is lost in the aggregation.
At DPLA we also index only metadata, not full text. In other contexts full-text may be relied on far more heavily.
Finally, as I demonstrated earlier, at DPLA we rely on item-level metadata, not contextual, collection-level metadata (although if you attend the Archival Description Working Group session, you’ll see that we might be developing some recommendations to improve that)
So when data is unsuited to the DPLA context, generally, this leads to two different kinds of problems…
The first are what I’m calling technical issues. These have to do not with the content, or the values, of the metadata being problematic, but with the implementation having issues.
Granularity we’ve already discussed.
I’m listing “inappropriate values” as a technical concern because by this I mean using the wrong metadata property for something. For example, using a date field for a digitization date, when we interpret that as a creation date.
Lack of normalization concerns inconsistent use of metadata and vocabulary across the sets. Using two different data formats within the same set for example is a lack of normalization.
Noise is basically meaningless data. This is a term coined by Diane Hillman, Naomi Dushay, and John Phipps to describe values in the National Science Digital Library aggregation that were blank, or carried phrases like “unknown” or “n/a” or were just punctuation – dash dash say. These values, again, may provide some value in their context – maybe all blank values contain dash dash and that means something – but in the aggregation it was noise
Finally, data that does not adhere to standards, both in terms of the metadata structure and the content standard is a kind of technical issue. The data may be correct, but it is not consistent or may be even unusable.
Content issues on the other do focus on data values. Several of these should also be credited to Hillman, Dushay, and Phipps and their work on NSDL.
Meaningless values is something that I’ve added. This is information that does really add to the overall value of the record such as a repetition of the providers name in a description field or simply incorrect or vague information.
Missing values are a pretty obvious problem.
Confusing values often happen as a result of losing granularity from the original record. So if both the date of digitization and the date of creation are collapsed into the date field together, that information conflicts and is confusing.
Finally, incomplete values occur when records get some minimal description, but the data isn’t robust enough to really accurate provide description. This is probably the area of quality that most of us are familiar with and actually think about, because it is the most obvious.
The kinds of issues that we are surfacing in this brief introduction to the context of quality in aggregation, are really related to work done a decade ago on Shareable Metadata, a term coined by Sarah Shreeves, Jenn Riley, and Liz Milewicsz. They predicted pretty much all of the issues we see in aggregated data and proposed areas in which we could standardize… The authors proposed six “C”s of quality in shareable records:
Content: what information would it take to make this content understandable to anyone. Will someone from another country understand that this picture of TR for example is Teddy Roosevelt or will they not recognize that acronym
Consistency: refers to consistent use of metadata elements so that you don’t have things tagged ambiguously. For example if you incorrectly use subject as a placeholder for publisher names, that won’t be consistent with the standard usage of that element
Coherence: means that records are self-explanatory and complete
Context: means that any information about context that is needed to make the record understandable outside of it’s original collection is explicitly included in the metadata
Communication: relates to the actual interaction between those who own and organize collection and another organization it is sharing data with. You won’t always be able to control this element, but when you can you should include all the relevant information like schema and vocabularies used, when the data was last updated, etc.
Finally, conformance to standards is key for sharing. Creating your own local standard may really suit your needs, but if no one else understands it, it won’t be useful.
I like to think of quality in our records as kind of a pyramid of needs. At the base are the more technical problems, like whether or not values are present and well formed. If those are taken care of, we can move on to work on content problems. Are the values in these fields meaningful? Finally, we can work on actually enhancing and improving records, based on the kinds of things Corey is analyzing: what are users really looking for and are we supporting those needs?
A big question remains though…How do we do this?
At DPLA we have a few simple processes, that I’d like to go over with you
The first step in our QA process is an initial review of data in a feed we want to harvest. I typically use a couple of different strategies to try and get a good look at the data, from the basic view of the data feed in a browser (this is OAI, which at the moment, is the easiest method for me to review data), to actually harvesting the data using python scripts and analyzing it in Open Refine.
I have a specific series of issues that I look for, from the often problematic like geographic terms, to the required like links to the original item.
Once the data at the source appears to be in good shape we harvest and map the records. We have set up an instance of blacklight for doing further QA on records. This modified version of the tool allows me to see the original record side by side with the transformed one for specific record analysis, but I do have the blacklight features of search and faceting to help with overall review.
We also created a limited number of reports. The reports are two types: the validation reports do a check for some of the things we require or recommend. The results of the "report" are really just search results. Right now it is showing more than 7,000 records without a "type", but that's actually okay. That's not a required field, it's just one I like to check on.
The field value reports are downloadable CSV reports. Thye list the DPLA id, the isShownAt URL, and all the values for whatever the field in question is. These are all "providedLabel" reports, so they show you your original values, not a value that might be the result of our enrichment.
And that is pretty much it. There is a lot more I would like to do. These methods mostly just allow me to verify some of those base level concerns of completeness and normalization, but the more difficult technical issues like standards adherence, or the content issues are things that I really can’t evaluate.
I think a lot of you are in much the same boat. And the reason I wanted us to get together today to talk about this is because
We need more tools and standards for data quality and we need better tools and standards for data quality.
…but as with the question I started with, most of us are using standards…the question becomes more of what standard can we agree on for this particular context of aggregation?
The Europeana community is beginning to work on defining more and better data standards for their aggregation through a relatively newly formed Data Quality Committee, which I have had the privilege to be a part of. Their remit is to review the following:
Mandatory metadata elements for ingestion of data adhering to the Europeana Data Model or EDM
The Committee is investigating if the current mandatory elements for EDM are relevant and sufficient. It also proposing methods to make legacy data compliant with the agreed list of mandatory elements.
The work will include recommendations on measures of completeness for descriptive metadata (i.e., not content) based on presence/absence of fields, not their values (which is the topic of ‘meaningful metadata values’) below.
Data checking and normalization
The Committee is also looking into ways and rules to normalise metadata. This includes the use of vocabulary based values or normalised values.
They are also making recommendations for tools and services to validate or detect anomalies in EDM
Meaningful metadata values (in the context of use)
The Committee is looking into ways of recommending meaningful metadata values (where 'meaningful' needs to be defined in the context of use) and indicators to measure improvements.
This work includes measures for information value of statements (informativeness, degree of multilinguality…)
The last two issues are less involved in metadata:
Quality of the content (digital media) itself
Coordination with other quality-related initiatives
At DPLA we have had a lot of internal conversations about quality. I’ve had conversations with a lot of you about the quality of your own data. But it isn’t enough to have these ad hoc conversations.
We need more data quality tools and standards in our network.
We need better definitions of what quality really means and how to achieve it.
So this is why we put this session together today: Let’s talk about it. I want to hear from you about your challenges, and your needs related to data quality and I want us to start working on solutions together.