EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Metadata Quality Assurance Framework at QQML2016 conference - full version
1. Metadata Quality Assurance Framework
Péter Király <peter.kiraly@gwdg.de>
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany
QQML2016
8th International Conference on Qualitative and Quantitative Methods in Libraries
2016-05-24, London
3. Metadata Quality Assurance Framework
3
Typical issues – non-informative field
Title is not informative
non informative:
„photograph, framed”,
„group photograph”
„photograph”
vs
informative:
„Photograph of Sir
Dugald Clerk”,
„Photograph of "Puffing Billy"
5. Metadata Quality Assurance Framework
5
Typical issues – Field overuse
What is the meaning of the field? (overuse)
TextGrid OAI-PMH response
6. Metadata Quality Assurance Framework
6
Why data quality is important?
„Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 19 May 2016
https://www.w3.org/TR/dwbp/
7. Metadata Quality Assurance Framework
7
Europeana Data Quality Committee
Online collaboration
Use case documents
Problem catalog
Tickets
Discussion forum
#EuropeanaDataQuality
Bi-weekly teleconf
Bi-yearly face-to-face
meeting
Topics
Usage scenarios
Metadata profiles
Schema modification
Measuring
Event model
Proposals for data
providers
8. Metadata Quality Assurance Framework
8
Research hypothesis
hypothesis
with measuring structural elements we
can predict metadata record quality
9. Metadata Quality Assurance Framework
9
What it is good for?
improve the metadata
improve services: good data → functions
improve metadata schema & documentation
propagate „good practice”
Domains:
cultural heritage sector
research data management and archiving
12. Metadata Quality Assurance Framework
12
Measurements
Schema-independent structural features
existence, cardinality, uniqueness, length,
dictionary entry, data type conformance
Use case scenarios („fit for purpose”)
Requirements of the most important functions
Problem catalog
Known metadata problems
13. Metadata Quality Assurance Framework
13
Discovery scenarios and their metadata requirements
Europeana’s most important functions
1. Basic retrieval with high precision and recall
2. Cross-language recall
3. Entity-based facets
4. Date-based facets
5. Improved language facets
6. Browse by subjects and resource types
7. Browse by agents
8. Browse/Search by Event
9. Entity-based knowledge cards and pages
10. Categorised similar items
11. Spatial search, browse, and map display
12. Entity-based autocompletion
13. Diversification of results
14. Hierarchical search and facets
Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
14. Metadata Quality Assurance Framework
14
Discovery scenarios and their metadata requirements – Entity-based facets
Scenario
As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated by identifying URIs rather than free
text. These URIs need to be related, at a minimum, to a label for
each of the supported languages.
Measurement rules
The relevant field values should be resolvable URI
each URI should have labels in multiple languages
15. Metadata Quality Assurance Framework
15
Discovery scenarios and their metadata requirements – Date-based facets
Scenario
I want to be able to filter my results by a variety of timespans, e.g.:
Date of creation
Date of publication
Date as subject
Metadata analysis
Dates should be fully and consistently normalised to follow the XSD
date-time data types. Dates expressed in styles like “490 avant J.C”
that are inherently language dependent should be avoided as they’re
very difficult to normalise (e.g. this should be represented as “-
0490”^^xsd:gYear).
Measurement rules
Field value should be XSD date-time data types
16. Metadata Quality Assurance Framework
16
Problem catalog
Catalog of known metadata problems in Europeana
Title contents same as description contents
Systematic use of the same title
Bad string: "empty" (and variants)
Shelfmarks and other identifiers in fields
Creator not an agent name
Absurd geographical location
Subject field used as description field
Unicode U+FFFD (�)
Very short description field
...
Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
17. Metadata Quality Assurance Framework
17
Problem catalog
Description Title contents same as description contents
Example /2023702/35D943DF60D779EC9EF31F5DF...
Motivation Distorts search weightings
Checking Method Field comparison
Notes Record display: creator concatenated onto title
Metadata Scenario Basic Retrieval
19. Metadata Quality Assurance Framework
19
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)
https://www.w3.org/TR/shacl/
A language for describing and constraining the contents of RDF
graphs. It provides a high-level vocabulary to identify predicates and
their associated cardinalities, datatypes and other constraints.
sh:equals, sh:notEquals
sh:hasValue
sh:in
sh:lessThan, sh:lessThanOrEquals
sh:minCount, sh:maxCount
sh:minLength, sh:maxLength
sh:pattern
31. Metadata Quality Assurance Framework
31
Field cardinality – histogram
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
36. Metadata Quality Assurance Framework
36
Language frequency / barchart
same language,
different encodings
37. Metadata Quality Assurance Framework
37
Language frequency / Treemap
has language
specification
has no language
specification
38. Metadata Quality Assurance Framework
38
Language frequency / Treemap with resources
has no language
specification
has language
specification
Is a URI
39. Metadata Quality Assurance Framework
39
Language frequency / Treemap + interaction + table
hide/display categories
table-like formal
41. Metadata Quality Assurance Framework
41
Entropy – term uniqueness / main
1 means a unique term
0.0000x means a very frequent term
These are cumulative numbers
entropycumolative = term1 + ... + termn
42. Metadata Quality Assurance Framework
42
Entropy – term uniqueness / collection
max is exceptional (=1425 * mean)
unique records
not or less unique records
43. Metadata Quality Assurance Framework
43
Entropy – term uniqueness / refining the picture
bulk of records are close to zero
although 25% are between 0.05 and 1.25
44. Metadata Quality Assurance Framework
44
Entropy – term uniqueness / field value
Russian text in transcribed Latin
writing szstem, not in Cyrillic
45. Metadata Quality Assurance Framework
45
Entropy – term uniqueness / terms
explanation of uniqueness score
TF-IDF values come from Apache Solr
term frequency: 1
document freq.: 2
uniqueness score: 0.5
47. Metadata Quality Assurance Framework
47
Problem catalog – Long subject
a record with 265 „long” subject heading
48. Metadata Quality Assurance Framework
48
Problem catalog – Long subject – example (not so long...)
Conclusion: we have to refine
the definition of „long”
49. Metadata Quality Assurance Framework
49
Problem catalog – same title and description
there is one title and
description which is the same
... and we have 9 such records
51. Metadata Quality Assurance Framework
51
completeness sub-dimensions
Are the sub-dimensions (field groups
supporting specific functionalities) complete?
55. Metadata Quality Assurance Framework
55
Further steps
Incorporating into Europeana’s ingestion tool
Process usage statistics (logs, Google Analitics)
Human evaluation of metadata quality
Measuring timeliness (changes of scores over time)
Machine learning based classification & clustering
Incorporating into research data management tool
Cooperation with other projects
56. Metadata Quality Assurance Framework
56
Project principles
Scalable, ready for big data
Loose coupling to metadata schemas
Transparency: open source, open data (CC0)
Release early, release often
Getting real [1]
Collaboration and communication
[1] https://gettingreal.37signals.com/
57. Metadata Quality Assurance Framework
57
Architectural overview
Apache Spark
(Java)
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(PHP, d3.js)
Hadoop File
System
JSON files
Apache Solr
Apache
Cassandra
JSON files
JSON files image files
CSV files
CSV files
recent workflow
planned workflow
58. Metadata Quality Assurance Framework
58
Follow me
Europeana Data Quality Committee
http://pro.europeana.eu/europeana-tech/data-
quality-committee
research plan and blog http://pkiraly.github.io
site http://144.76.218.178/europeana-qa/
source codes
https://github.com/pkiraly/europeana-qa-spark
https://github.com/pkiraly/europeana-qa-r
@kiru, https://www.linkedin.com/in/peterkiraly