SlideShare a Scribd company logo
Improving data quality at Europeana
New requirements and methods for
better measuring metadata quality
Péter Király1, Hugo Manguinhas2, Valentine Charles2, Antoine Isaac2, Timothy
Hill2
1Gesellschaft für wissenschaftliche
Datenverarbeitung mbH Göttingen
2Europeana Foundation,
The Netherlands
Improving data quality at Europeana. The data workflow
2
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...
Improving data quality at Europeana. The problem
3
there are “good” and “bad” metadata records
but we don’t have clear metrics like this:
functional requirements
goodacceptablebad
Improving data quality at Europeana. Non-informative values
4
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
Improving data quality at Europeana. Copy & paste cataloging
5
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
Improving data quality at Europeana. Why data quality is important?
6
“Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
Improving data quality at Europeana. Data Quality Committee
7
Improving data quality at Europeana. Hypothesis
8
by measuring structural elements we
can predict metadata record quality
≃ metadata smell
Improving data quality at Europeana. Purposes
9
▪ improve the metadata
▪ services: good data → reliable functions
▪ better metadata schema & documentation
▪ propagate “good practice”
Improving data quality at Europeana. What to measure?
10
▪ Structural and semantic features
Cardinality, uniqueness, length, dictionary entry, data type conformance,
multilinguality (schema-independent measurements)
▪ Discovery scenarios
Requirements of the most important functions
▪ Problem catalog
Known metadata problems
Improving data quality at Europeana. Discovery scenarios
11
▪ Basic retrieval with high precision and recall
▪ Cross-language recall
▪ Entity-based facets
▪ Date-based facets
▪ Improved language facets
▪ Browse by subjects and resource types
▪ Browse by agents
▪ Hierarchical search and facets
▪ ...
themostimportantfunctions
Improving data quality at Europeana. Metadata requirements
12
As a user I want to be able to filter by whether a person is the subject
of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated with URIs rather than free text. These
URIs need to be related, at a minimum, to a label for each of the
supported languages.
Measurement rules
▪ the relevant field values should be resolvable URI
▪ each URI should be associated with labels in multiple languages
Improving data quality at Europeana. Problem catalog
13
▪ Title contents same as description contents
▪ Systematic use of the same title
▪ Bad string: “empty” (and variants)
▪ Shelfmarks and other identifiers in fields
▪ Creator not an agent name
▪ Absurd geographical location
▪ Subject field used as description field
▪ Unicode U+FFFD (�)
▪ Very short description field
▪ ...
“metadataanti-patterns”
Improving data quality at Europeana. Problem definition
14
Description Title contents same as description
contents
Example
/2023702/35D943DF60D779EC9EF31F5DF...
Motivation Distorts search weightings
Checking Method Field comparison
Notes Record display: creator
concatenated onto title
Metadata ScenarioBasic Retrieval
Improving data quality at Europeana. Measurement
15
overall view collection view record view
Completeness – 40 measurements
Field cardinality – 127 measurements
Uniqueness – 6 measurements
Multilinguality – 300+ measurements
Language specification – 127 measurements
Problem catalog – 3 measurements
etc.
links
measurementsaggregated numbers
Improving data quality at Europeana. Field frequency per collections
16
no record has alternative title
every record has alternative title
filters
Improving data quality at Europeana. Details of field cardinality
17
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
Improving data quality at Europeana. Multilinguality
18
@resource is a URI
@ = language notation in RDF
no language specification
Improving data quality at Europeana. Language frequency
19
has language
specification
has no language
specification
Improving data quality at Europeana. Encoding problems
20
same language,
different encodings
Improving data quality at Europeana. Multilingual saturation
21
Levels of Multilinguality per field Expressed in numbers
Missing field NA
Text string without language tag 0
Text string with language tag 1
Text string with 2-3 different language tags 2
Text string with 4-9 different language tags 2.3
Text string with 10+ different language tags 2.6
Link to controlled vocabulary 3
Penalty for strings mixed with translations with no language tag -0.2
Improving data quality at Europeana. Multilingual saturation
22
Improving data quality at Europeana. Information content
23
1 means a unique term
0.0000x means a very frequent term
These are cumulative numbers
entropycumulative = term1 + ... + termn
Improving data quality at Europeana. Outliers
24
bulk of records are close to zero
although 25% are between 0.05 and 1.25
Improving data quality at Europeana. Architecture
25
Apache Spark
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(php, d3.js)
Hadoop File
System
JSON files
Apache Solr
NoSQL
datastore
JSON files
JSON files image files
CSV files
CSV files
recent workflow
planned workflow
Improving data quality at Europeana. Further steps
26
▪ Translate the results into
documentation,
recommendations
▪ Communication with data
providers
▪ Human evaluation of metadata
quality
▪ Cooperation with other projects
▪ Incorporating into Europeana’s
new ingestion tool
▪ Shape Constraint Language
(SHACL) for defining patterns
▪ Process usage statistics
▪ Measuring changes of scores
▪ Machine learning based
classification & clustering
human analysis technical
Improving data quality at Europeana. Links
27
▪ Europeana Data Quality Committee:
http://pro.europeana.eu/europeana-tech/data-quality-
committee
▪ site: http://144.76.218.178/europeana-qa/
▪ codes: http://pkiraly.github.io/about/#source-codes

More Related Content

What's hot

Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
Adrian Paschke
 
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Nandana Mihindukulasooriya
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
Saeedeh Shekarpour
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
GESIS
 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
WU (Vienna University of Economics and Business)
 
Introduction to linked data and the semantic web
Introduction to linked data and the semantic webIntroduction to linked data and the semantic web
Introduction to linked data and the semantic web
Dave Reynolds
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
Andre Freitas
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Daniel Valcarce
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Saeedeh Shekarpour
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
Vlad Vega
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
Richard.Sapon-White
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
Peter Haase
 
2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk
Jun Zhao
 
All aboard the Semantic Bandwagon
All aboard the Semantic BandwagonAll aboard the Semantic Bandwagon
All aboard the Semantic Bandwagon
Royal Society of Chemistry
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
semanticsconference
 
ICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and mediaICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and media
gjhouben
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
Vito Ostuni
 
Mods0210
Mods0210Mods0210
Mods0210
Song,Yoo-hwa
 
Data analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical mapsData analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical maps
vty
 

What's hot (20)

Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
 
Introduction to linked data and the semantic web
Introduction to linked data and the semantic webIntroduction to linked data and the semantic web
Introduction to linked data and the semantic web
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk
 
All aboard the Semantic Bandwagon
All aboard the Semantic BandwagonAll aboard the Semantic Bandwagon
All aboard the Semantic Bandwagon
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
 
ICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and mediaICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and media
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Mods0210
Mods0210Mods0210
Mods0210
 
Data analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical mapsData analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical maps
 

Similar to Improving data quality at Europeana (SWIB 2016)

Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Péter Király
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Péter Király
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
Péter Király
 
Data Quality Assessment in Europeana: Metrics for Multilinguality
Data Quality Assessment in Europeana:  Metrics for MultilingualityData Quality Assessment in Europeana:  Metrics for Multilinguality
Data Quality Assessment in Europeana: Metrics for Multilinguality
Juliane Stiller
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
andrea huang
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
Péter Király
 
Data quality in cultural heritage (meta)data
Data quality in cultural heritage (meta)dataData quality in cultural heritage (meta)data
Data quality in cultural heritage (meta)data
Valentine Charles
 
LDBC 19 November 2013
LDBC 19 November 2013  LDBC 19 November 2013
LDBC 19 November 2013
Europeana
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
Jiaheng Lu
 
Fondly Collisions: Archival hierarchy and the Europeana Data Model
Fondly Collisions: Archival hierarchy and the Europeana Data Model   Fondly Collisions: Archival hierarchy and the Europeana Data Model
Fondly Collisions: Archival hierarchy and the Europeana Data Model
Valentine Charles
 
Europeana and open data
Europeana and open dataEuropeana and open data
Europeana and open data
RobinaClayphan
 
Open Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODOpen Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LOD
Antoine Isaac
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
Ivan Herman
 
20140521 sem-tech-biz-guest-lecture
20140521 sem-tech-biz-guest-lecture20140521 sem-tech-biz-guest-lecture
20140521 sem-tech-biz-guest-lecture
Vladimir Alexiev, PhD, PMP
 
Building better knowledge graphs through social computing
Building better knowledge graphs through social computingBuilding better knowledge graphs through social computing
Building better knowledge graphs through social computing
Elena Simperl
 
Data curation and data archiving at different stages of the research process
Data curation and data archiving at different stages of the research processData curation and data archiving at different stages of the research process
Data curation and data archiving at different stages of the research process
Andrea Scharnhorst
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
Tope Omitola
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
ErhardRahm
 

Similar to Improving data quality at Europeana (SWIB 2016) (20)

Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Data Quality Assessment in Europeana: Metrics for Multilinguality
Data Quality Assessment in Europeana:  Metrics for MultilingualityData Quality Assessment in Europeana:  Metrics for Multilinguality
Data Quality Assessment in Europeana: Metrics for Multilinguality
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
Data quality in cultural heritage (meta)data
Data quality in cultural heritage (meta)dataData quality in cultural heritage (meta)data
Data quality in cultural heritage (meta)data
 
LDBC 19 November 2013
LDBC 19 November 2013  LDBC 19 November 2013
LDBC 19 November 2013
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Fondly Collisions: Archival hierarchy and the Europeana Data Model
Fondly Collisions: Archival hierarchy and the Europeana Data Model   Fondly Collisions: Archival hierarchy and the Europeana Data Model
Fondly Collisions: Archival hierarchy and the Europeana Data Model
 
Europeana and open data
Europeana and open dataEuropeana and open data
Europeana and open data
 
Open Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LODOpen Data Masterclass - Europeana and LOD
Open Data Masterclass - Europeana and LOD
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
 
20140521 sem-tech-biz-guest-lecture
20140521 sem-tech-biz-guest-lecture20140521 sem-tech-biz-guest-lecture
20140521 sem-tech-biz-guest-lecture
 
Building better knowledge graphs through social computing
Building better knowledge graphs through social computingBuilding better knowledge graphs through social computing
Building better knowledge graphs through social computing
 
Data curation and data archiving at different stages of the research process
Data curation and data archiving at different stages of the research processData curation and data archiving at different stages of the research process
Data curation and data archiving at different stages of the research process
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 

More from Péter Király

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Péter Király
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
Péter Király
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
Péter Király
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
Péter Király
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
Péter Király
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Péter Király
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Péter Király
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Péter Király
 

More from Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 

Recently uploaded

Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
kalichargn70th171
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
Prestware
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 

Recently uploaded (20)

Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 

Improving data quality at Europeana (SWIB 2016)

  • 1. Improving data quality at Europeana New requirements and methods for better measuring metadata quality Péter Király1, Hugo Manguinhas2, Valentine Charles2, Antoine Isaac2, Timothy Hill2 1Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 2Europeana Foundation, The Netherlands
  • 2. Improving data quality at Europeana. The data workflow 2 data transformations Europeana Data Model (EDM) Dublin Core, LIDO, EAD, MARC, EDM custom, ...
  • 3. Improving data quality at Europeana. The problem 3 there are “good” and “bad” metadata records but we don’t have clear metrics like this: functional requirements goodacceptablebad
  • 4. Improving data quality at Europeana. Non-informative values 4 non informative dc:title: “photograph, framed”, “group photograph” “photograph” informative dc:title: “Photograph of Sir Dugald Clerk”, “Photograph of "Puffing Billy"”
  • 5. Improving data quality at Europeana. Copy & paste cataloging 5 from a template? more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
  • 6. Improving data quality at Europeana. Why data quality is important? 6 “Fitness for purpose” (QA principle) no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/
  • 7. Improving data quality at Europeana. Data Quality Committee 7
  • 8. Improving data quality at Europeana. Hypothesis 8 by measuring structural elements we can predict metadata record quality ≃ metadata smell
  • 9. Improving data quality at Europeana. Purposes 9 ▪ improve the metadata ▪ services: good data → reliable functions ▪ better metadata schema & documentation ▪ propagate “good practice”
  • 10. Improving data quality at Europeana. What to measure? 10 ▪ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ▪ Discovery scenarios Requirements of the most important functions ▪ Problem catalog Known metadata problems
  • 11. Improving data quality at Europeana. Discovery scenarios 11 ▪ Basic retrieval with high precision and recall ▪ Cross-language recall ▪ Entity-based facets ▪ Date-based facets ▪ Improved language facets ▪ Browse by subjects and resource types ▪ Browse by agents ▪ Hierarchical search and facets ▪ ... themostimportantfunctions
  • 12. Improving data quality at Europeana. Metadata requirements 12 As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis In each case the underlying requirement is that the relevant EDM fields for objects be populated with URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages. Measurement rules ▪ the relevant field values should be resolvable URI ▪ each URI should be associated with labels in multiple languages
  • 13. Improving data quality at Europeana. Problem catalog 13 ▪ Title contents same as description contents ▪ Systematic use of the same title ▪ Bad string: “empty” (and variants) ▪ Shelfmarks and other identifiers in fields ▪ Creator not an agent name ▪ Absurd geographical location ▪ Subject field used as description field ▪ Unicode U+FFFD (�) ▪ Very short description field ▪ ... “metadataanti-patterns”
  • 14. Improving data quality at Europeana. Problem definition 14 Description Title contents same as description contents Example /2023702/35D943DF60D779EC9EF31F5DF... Motivation Distorts search weightings Checking Method Field comparison Notes Record display: creator concatenated onto title Metadata ScenarioBasic Retrieval
  • 15. Improving data quality at Europeana. Measurement 15 overall view collection view record view Completeness – 40 measurements Field cardinality – 127 measurements Uniqueness – 6 measurements Multilinguality – 300+ measurements Language specification – 127 measurements Problem catalog – 3 measurements etc. links measurementsaggregated numbers
  • 16. Improving data quality at Europeana. Field frequency per collections 16 no record has alternative title every record has alternative title filters
  • 17. Improving data quality at Europeana. Details of field cardinality 17 128 subjects in one record median is 0, mean is close to 1 link to interesting records
  • 18. Improving data quality at Europeana. Multilinguality 18 @resource is a URI @ = language notation in RDF no language specification
  • 19. Improving data quality at Europeana. Language frequency 19 has language specification has no language specification
  • 20. Improving data quality at Europeana. Encoding problems 20 same language, different encodings
  • 21. Improving data quality at Europeana. Multilingual saturation 21 Levels of Multilinguality per field Expressed in numbers Missing field NA Text string without language tag 0 Text string with language tag 1 Text string with 2-3 different language tags 2 Text string with 4-9 different language tags 2.3 Text string with 10+ different language tags 2.6 Link to controlled vocabulary 3 Penalty for strings mixed with translations with no language tag -0.2
  • 22. Improving data quality at Europeana. Multilingual saturation 22
  • 23. Improving data quality at Europeana. Information content 23 1 means a unique term 0.0000x means a very frequent term These are cumulative numbers entropycumulative = term1 + ... + termn
  • 24. Improving data quality at Europeana. Outliers 24 bulk of records are close to zero although 25% are between 0.05 and 1.25
  • 25. Improving data quality at Europeana. Architecture 25 Apache Spark OAI-PMH client (PHP) Analysis with Spark (Scala) Analysis with R Web interface (php, d3.js) Hadoop File System JSON files Apache Solr NoSQL datastore JSON files JSON files image files CSV files CSV files recent workflow planned workflow
  • 26. Improving data quality at Europeana. Further steps 26 ▪ Translate the results into documentation, recommendations ▪ Communication with data providers ▪ Human evaluation of metadata quality ▪ Cooperation with other projects ▪ Incorporating into Europeana’s new ingestion tool ▪ Shape Constraint Language (SHACL) for defining patterns ▪ Process usage statistics ▪ Measuring changes of scores ▪ Machine learning based classification & clustering human analysis technical
  • 27. Improving data quality at Europeana. Links 27 ▪ Europeana Data Quality Committee: http://pro.europeana.eu/europeana-tech/data-quality- committee ▪ site: http://144.76.218.178/europeana-qa/ ▪ codes: http://pkiraly.github.io/about/#source-codes