SlideShare a Scribd company logo
1 of 16
DATA
QUALITY
AT THE SCALE OF AGGREGATION
IF WE ALL USE STANDARDS,
WHY IS THE DATA SO CRAP IN
THE END?
QUALITY IS CONTEXTUAL
QUALITY IS CONTEXTUAL
What is the “context” of aggregation? Specifically,
DPLA’s aggregation…
• Heterogeneous
• Basic metadata
• Reliance on metadata vs. text
• Reliance on item-level metadata
DATA ISSUES IN DPLA
Content Issues
• Meaningless
values
• Missing values
• Confusing values
• Incomplete values
Technical Issues
• Granularity
• Inappropriate
values
• Lack of
normalization
• Noisy data
• Lack of standards
SHARING METADATA
Content
Consistency
Coherence
Context
Communication
Conformance to standards
…but which “standard”
DPLA & DATA
QUALITY
Data is
robust
Descriptive fields
are present and
have meaningful
values
Required properties have
meaningful values
Data adheres to standards
All data is normalized in terms of punctuation,
presence of noise, etc.
Required properties are present and semantically
correct
Technical
problems
Content
problems
Content
quality
DPLA DATA QUALITY
WORKFLOW
Initial Analysis
QA in Blacklight
Visual review in test portal site
WE NEED MORE.
WE NEED BETTER.
EUROPEANA DQC
Data Quality Committee (DQC) formed within Europeana
• Reviewing mandatory elements
• Data checking and normalization
• Evaluation of meaningful metadata values
• Quality of content
• Coordination with other quality-related initiatives
DPLA QUALITY
INITIATIVES
WE NEED MORE.
WE NEED BETTER.
LET’S TALK.

More Related Content

Viewers also liked

Viewers also liked (14)

Christmas and New year special offer
Christmas and New year special offer Christmas and New year special offer
Christmas and New year special offer
 
Reunión final Club de Lectura 2016
Reunión final Club de Lectura 2016Reunión final Club de Lectura 2016
Reunión final Club de Lectura 2016
 
7 инф
7  инф7  инф
7 инф
 
018_muusa gertu
018_muusa gertu018_muusa gertu
018_muusa gertu
 
Karina estefania godoy montaño
Karina estefania godoy montañoKarina estefania godoy montaño
Karina estefania godoy montaño
 
Kira
KiraKira
Kira
 
039 kompu
039 kompu039 kompu
039 kompu
 
IsabellaBaer-IBWorkshop-18Jan2016
IsabellaBaer-IBWorkshop-18Jan2016IsabellaBaer-IBWorkshop-18Jan2016
IsabellaBaer-IBWorkshop-18Jan2016
 
Nâng mũi bị kéo mắt không
Nâng mũi bị kéo mắt khôngNâng mũi bị kéo mắt không
Nâng mũi bị kéo mắt không
 
отношения
отношенияотношения
отношения
 
Guidance note-on-annual-return companies act 2013
Guidance note-on-annual-return companies act 2013Guidance note-on-annual-return companies act 2013
Guidance note-on-annual-return companies act 2013
 
Kak ucheniki nashey_shkoly_otvetili_na_vopros
Kak ucheniki nashey_shkoly_otvetili_na_voprosKak ucheniki nashey_shkoly_otvetili_na_vopros
Kak ucheniki nashey_shkoly_otvetili_na_vopros
 
ярмарка
ярмаркаярмарка
ярмарка
 
кросс
кросскросс
кросс
 

Similar to Data Quality at the Scale of Aggregation

Data Quality Tools In Data Migrations
Data Quality Tools In Data MigrationsData Quality Tools In Data Migrations
Data Quality Tools In Data MigrationsSteve Tuck
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality DimensionsCarlos Guerreiro
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesConnected Data World
 
Enterprise Data World Webinars: Data Quality for Data Modelers
Enterprise Data World Webinars: Data Quality for Data ModelersEnterprise Data World Webinars: Data Quality for Data Modelers
Enterprise Data World Webinars: Data Quality for Data ModelersDATAVERSITY
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010ERwin Modeling
 
Data labeling company | Learning Spiral AI
Data labeling company | Learning Spiral AIData labeling company | Learning Spiral AI
Data labeling company | Learning Spiral AILearning Spiral Pvt. Ltd
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan DundovicDataScienceConferenc1
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth
 

Similar to Data Quality at the Scale of Aggregation (9)

Data Quality Tools In Data Migrations
Data Quality Tools In Data MigrationsData Quality Tools In Data Migrations
Data Quality Tools In Data Migrations
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality Dimensions
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the pieces
 
Enterprise Data World Webinars: Data Quality for Data Modelers
Enterprise Data World Webinars: Data Quality for Data ModelersEnterprise Data World Webinars: Data Quality for Data Modelers
Enterprise Data World Webinars: Data Quality for Data Modelers
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010
 
Data labeling company | Learning Spiral AI
Data labeling company | Learning Spiral AIData labeling company | Learning Spiral AI
Data labeling company | Learning Spiral AI
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
 

More from Gretchen Gueguen

Linked Data: Uses and Users
Linked Data: Uses and UsersLinked Data: Uses and Users
Linked Data: Uses and UsersGretchen Gueguen
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateGretchen Gueguen
 
DPLA's Archival Description Working Group Update
DPLA's Archival Description Working Group UpdateDPLA's Archival Description Working Group Update
DPLA's Archival Description Working Group UpdateGretchen Gueguen
 
Do Digital Archivists Dream of Electronic Records
Do Digital Archivists Dream of Electronic RecordsDo Digital Archivists Dream of Electronic Records
Do Digital Archivists Dream of Electronic RecordsGretchen Gueguen
 
Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...Gretchen Gueguen
 
National History Day Projects
National History Day ProjectsNational History Day Projects
National History Day ProjectsGretchen Gueguen
 
The Daily Reflector Image Collection: Best Practices in the Classroom
The Daily Reflector Image Collection: Best Practices in the ClassroomThe Daily Reflector Image Collection: Best Practices in the Classroom
The Daily Reflector Image Collection: Best Practices in the ClassroomGretchen Gueguen
 
Seeds Of Change Technical Implementation
Seeds Of Change Technical ImplementationSeeds Of Change Technical Implementation
Seeds Of Change Technical ImplementationGretchen Gueguen
 
Crowdsourcing Digitization: Harnessing Workflows to Increase Output
Crowdsourcing Digitization: Harnessing Workflows to Increase OutputCrowdsourcing Digitization: Harnessing Workflows to Increase Output
Crowdsourcing Digitization: Harnessing Workflows to Increase OutputGretchen Gueguen
 

More from Gretchen Gueguen (11)

Linked Data: Uses and Users
Linked Data: Uses and UsersLinked Data: Uses and Users
Linked Data: Uses and Users
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group Update
 
DPLA's Archival Description Working Group Update
DPLA's Archival Description Working Group UpdateDPLA's Archival Description Working Group Update
DPLA's Archival Description Working Group Update
 
Collecting in the Moment
Collecting in the MomentCollecting in the Moment
Collecting in the Moment
 
Do Digital Archivists Dream of Electronic Records
Do Digital Archivists Dream of Electronic RecordsDo Digital Archivists Dream of Electronic Records
Do Digital Archivists Dream of Electronic Records
 
Capturing the Zeitgeist
Capturing the ZeitgeistCapturing the Zeitgeist
Capturing the Zeitgeist
 
Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...Just keep clicking Till You Find It: Building a Library Digital Collection In...
Just keep clicking Till You Find It: Building a Library Digital Collection In...
 
National History Day Projects
National History Day ProjectsNational History Day Projects
National History Day Projects
 
The Daily Reflector Image Collection: Best Practices in the Classroom
The Daily Reflector Image Collection: Best Practices in the ClassroomThe Daily Reflector Image Collection: Best Practices in the Classroom
The Daily Reflector Image Collection: Best Practices in the Classroom
 
Seeds Of Change Technical Implementation
Seeds Of Change Technical ImplementationSeeds Of Change Technical Implementation
Seeds Of Change Technical Implementation
 
Crowdsourcing Digitization: Harnessing Workflows to Increase Output
Crowdsourcing Digitization: Harnessing Workflows to Increase OutputCrowdsourcing Digitization: Harnessing Workflows to Increase Output
Crowdsourcing Digitization: Harnessing Workflows to Increase Output
 

Recently uploaded

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Data Quality at the Scale of Aggregation

Editor's Notes

  1. I’m going to wrap things up by taking a look at quality from the perspective of DPLA, from the perspective of an aggregator. I was recently at an unconference – the kind where you get together in the morning and decide what you want to talk about. Several colleagues and I proposed a session on data quality and one of them, I believe it was Mike Giarlo, proposed the following for a title
  2. If we all use standards, why is the data so crap in the end? But I think that isn’t quite accurate. In fact, I think the data isn’t crap in it’s proper context.
  3. Because in fact, quality is contextual Let me show you what I mean
  4. Here is a record in DPLA. It happens to come from UNC Chapel Hill, although I don’t mean to single them out. This record has a kind of peculiar and generic title, no date, minimal description, no subjects. By our metrics of quality at DPLA, this record isn’t so hot.
  5. However, in it’s original context, this record benefits from the fact that it is part of a finding aid. It isn’t really meant to be thought of as a single record, or at least it wasn’t created that way.
  6. The finding aid has lots of information about this and all of the other images as a group. But the finding aid works in its own context, when it is viewed as a description of an entire collection. When you take the record out of that context and put it in DPLA it suffers. This is a particular kind of context related issue, but it isn’t the only one. It’s the same for local subject headings, very granular standards, very discipline-specific standards. For example, a record for a film might have specific roles for contributors: directors, producers, costumers, actors. But in the aggregation context that nuance gets flattened down to contributor.
  7. If we want to improve records in DPLA, I think we need to acknowledge that it is a different context and talk about what that means. We shouldn’t imply that quality is a single standard that will fit all, but that there is a definable standard for what quality is within the DPLA context. When partners want to have their records work well in the DPLA context, they need to know specifically what those aggregation-context quality characteristics are, as opposed to what may be a characteristic of quality at their home institution. So if we were to start to define the aggregation context, we would say, first of all, that this is a very heterogeneous environment. Not every aggregation is hetereogeneous, but DPLA is. We do have some areas we don’t collect, like scholarly journals or finding aids, but generally speaking we take in a lot of diverse content. We also rely on fairly basic metadata. It’s probably most closely aligned with the Dublin Core terms namespace, or qualified Dublin Core, but it was not developed to capture nuanced, domain-specific information. To work well in this context, data has to work well within the simplified context. If a record for a biology monograph has several hundred taxonomic names indexed to it, and those can’t be mapped to something like subject, well then that data is lost in the aggregation. At DPLA we also index only metadata, not full text. In other contexts full-text may be relied on far more heavily. Finally, as I demonstrated earlier, at DPLA we rely on item-level metadata, not contextual, collection-level metadata (although if you attend the Archival Description Working Group session, you’ll see that we might be developing some recommendations to improve that)
  8. So when data is unsuited to the DPLA context, generally, this leads to two different kinds of problems… The first are what I’m calling technical issues. These have to do not with the content, or the values, of the metadata being problematic, but with the implementation having issues. Granularity we’ve already discussed. I’m listing “inappropriate values” as a technical concern because by this I mean using the wrong metadata property for something. For example, using a date field for a digitization date, when we interpret that as a creation date. Lack of normalization concerns inconsistent use of metadata and vocabulary across the sets. Using two different data formats within the same set for example is a lack of normalization. Noise is basically meaningless data. This is a term coined by Diane Hillman, Naomi Dushay, and John Phipps to describe values in the National Science Digital Library aggregation that were blank, or carried phrases like “unknown” or “n/a” or were just punctuation – dash dash say. These values, again, may provide some value in their context – maybe all blank values contain dash dash and that means something – but in the aggregation it was noise Finally, data that does not adhere to standards, both in terms of the metadata structure and the content standard is a kind of technical issue. The data may be correct, but it is not consistent or may be even unusable. Content issues on the other do focus on data values. Several of these should also be credited to Hillman, Dushay, and Phipps and their work on NSDL. Meaningless values is something that I’ve added. This is information that does really add to the overall value of the record such as a repetition of the providers name in a description field or simply incorrect or vague information. Missing values are a pretty obvious problem. Confusing values often happen as a result of losing granularity from the original record. So if both the date of digitization and the date of creation are collapsed into the date field together, that information conflicts and is confusing. Finally, incomplete values occur when records get some minimal description, but the data isn’t robust enough to really accurate provide description. This is probably the area of quality that most of us are familiar with and actually think about, because it is the most obvious.
  9. The kinds of issues that we are surfacing in this brief introduction to the context of quality in aggregation, are really related to work done a decade ago on Shareable Metadata, a term coined by Sarah Shreeves, Jenn Riley, and Liz Milewicsz. They predicted pretty much all of the issues we see in aggregated data and proposed areas in which we could standardize… The authors proposed six “C”s of quality in shareable records: Content: what information would it take to make this content understandable to anyone. Will someone from another country understand that this picture of TR for example is Teddy Roosevelt or will they not recognize that acronym Consistency: refers to consistent use of metadata elements so that you don’t have things tagged ambiguously. For example if you incorrectly use subject as a placeholder for publisher names, that won’t be consistent with the standard usage of that element Coherence: means that records are self-explanatory and complete Context: means that any information about context that is needed to make the record understandable outside of it’s original collection is explicitly included in the metadata Communication: relates to the actual interaction between those who own and organize collection and another organization it is sharing data with. You won’t always be able to control this element, but when you can you should include all the relevant information like schema and vocabularies used, when the data was last updated, etc. Finally, conformance to standards is key for sharing. Creating your own local standard may really suit your needs, but if no one else understands it, it won’t be useful.
  10. I like to think of quality in our records as kind of a pyramid of needs. At the base are the more technical problems, like whether or not values are present and well formed. If those are taken care of, we can move on to work on content problems. Are the values in these fields meaningful? Finally, we can work on actually enhancing and improving records, based on the kinds of things Corey is analyzing: what are users really looking for and are we supporting those needs?
  11. A big question remains though…How do we do this? At DPLA we have a few simple processes, that I’d like to go over with you
  12. The first step in our QA process is an initial review of data in a feed we want to harvest. I typically use a couple of different strategies to try and get a good look at the data, from the basic view of the data feed in a browser (this is OAI, which at the moment, is the easiest method for me to review data), to actually harvesting the data using python scripts and analyzing it in Open Refine. I have a specific series of issues that I look for, from the often problematic like geographic terms, to the required like links to the original item.
  13. Once the data at the source appears to be in good shape we harvest and map the records. We have set up an instance of blacklight for doing further QA on records. This modified version of the tool allows me to see the original record side by side with the transformed one for specific record analysis, but I do have the blacklight features of search and faceting to help with overall review. We also created a limited number of reports. The reports are two types: the validation reports do a check for some of the things we require or recommend. The results of the "report" are really just search results. Right now it is showing more than 7,000 records without a "type", but that's actually okay. That's not a required field, it's just one I like to check on. The field value reports are downloadable CSV reports. Thye list  the DPLA id, the isShownAt URL, and all the values for whatever the field in question is. These are all "providedLabel" reports, so they show you your original values, not a value that might be the result of our enrichment. And that is pretty much it. There is a lot more I would like to do. These methods mostly just allow me to verify some of those base level concerns of completeness and normalization, but the more difficult technical issues like standards adherence, or the content issues are things that I really can’t evaluate. I think a lot of you are in much the same boat. And the reason I wanted us to get together today to talk about this is because
  14. We need more tools and standards for data quality and we need better tools and standards for data quality. …but as with the question I started with, most of us are using standards…the question becomes more of what standard can we agree on for this particular context of aggregation?
  15. The Europeana community is beginning to work on defining more and better data standards for their aggregation through a relatively newly formed Data Quality Committee, which I have had the privilege to be a part of. Their remit is to review the following: Mandatory metadata elements for ingestion of data adhering to the Europeana Data Model or EDM The Committee is investigating if the current mandatory elements for EDM are relevant and sufficient. It also proposing methods to make legacy data compliant with the agreed list of mandatory elements. The work will include recommendations on measures of completeness for descriptive metadata (i.e., not content) based on presence/absence of fields, not their values (which is the topic of ‘meaningful metadata values’) below. Data checking and normalization The Committee is also looking into ways and rules to normalise metadata. This includes the use of vocabulary based values or normalised values. They are also making recommendations for tools and services to validate or detect anomalies in EDM Meaningful metadata values (in the context of use) The Committee is looking into ways of recommending meaningful metadata values (where 'meaningful' needs to be defined in the context of use) and indicators to measure improvements. This work includes measures for information value of statements (informativeness, degree of multilinguality…) The last two issues are less involved in metadata: Quality of the content (digital media) itself Coordination with other quality-related initiatives
  16. At DPLA we have had a lot of internal conversations about quality. I’ve had conversations with a lot of you about the quality of your own data. But it isn’t enough to have these ad hoc conversations. We need more data quality tools and standards in our network. We need better definitions of what quality really means and how to achieve it. So this is why we put this session together today: Let’s talk about it. I want to hear from you about your challenges, and your needs related to data quality and I want us to start working on solutions together.