Metadata Quality Assurance Part II. The implementation begins

Metadata Quality Assurance Framwork
Part II. – The implementation begins
Péter Király
peter.kiraly@gwdg.de
Göttingen, Geiststraße 10, GWDG meeting room 20/05/2016
Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Metadata Quality Assurance Framework
2
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 17 December 2015
http://www.w3.org/TR/2015/WD-dwbp-20151217/

3
What it is good for?
 Improve the metadata
 Improve metadata schema and its docum.
 Propagate „good practice”
 Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG:
 Could be built in to current and planned data
management / data archiving tools

4
Project principles
 Full transparency
 Open source, open data (CC0)
 Minimal viable product
 „Release early. Release often. And listen to
your customers” (Eric S. Raymond)
 „Eat your own dog food”
 Getting real https://gettingreal.37signals.com/

5
Measurements
Schema-independent structural features
Existence, cardinality, uniqueness
Use case scenarios („fit for purpose”)
Requirements of the most important functions
Problem catalog
Known metadata problems

6
Europeana Data Quality Committee
 Online collaboration
 Use case documents
 Problem catalog
 Tickets
 Discussion forum
 #EuropeanaDataQuali
ty
 Bi-weekly teleconf
 Bi-yearly face-to-face
meeting
 Topics
 Usage scenarios
 Metadata profiles
 Schema modification
 Measuring
 Event model

7
Discovery scenarios and their metadata requirements
1. Basic retrieval with high precision and recall
2. Cross-language recall
3. Entity-based facets
4. Date-based facets
5. Improved language facets
6. Browse by subjects and resource types
7. Browse by agents
8. Browse/Search by Event
9. Entity-based knowledge cards and pages
10.Categorised similar items
11.Spatial search, browse, and map display
12.Entity-based autocompletion
13.Diversification of results
14.Hierarchical search and facets
Credit: the document was initialized by Tim Hill, Europeana’s search engineer

8
Discovery scenarios and their metadata requirements - 3. Entity-based facets
Scenario
As a user, ... I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated by identifying URIs rather than free
text. These URIs need to be related, at a minimum, to a label for
each of the supported languages.
Measurement rules
 The relevant field values should be resolvable URI
 each URI should have labels in multiple languages

9
Discovery scenarios and their metadata requirements – 4. Date-based facets
Scenario
I want to be able to filter my results by a variety of timespans, e.g.:
 Date of creation
 Date of publication
 Date as subject
Metadata analysis
Dates should be fully and consistently normalised to follow the XSD
date-time data types. Dates expressed in styles like “490 avant J.C”
that are inherently language dependent should be avoided as they’re
very difficult to normalise (e.g. this should be represented as “-
0490”^^xsd:gYear).
Measurement rules
 Field value should be XSD date-time data types

10
Problem catalog
 Title contents same as description contents
 Systematic use of the same title
 Bad string: "empty" (and variants)
 Shelfmarks and other identifiers in fields
 Creator not an agent name
 Absurd geographical location
 Subject field used as description field
 Unicode U+FFFD (�)
 Very short description field
Credit: the document was initialized by Tim Hill, Europeana’s search engineer

11
Problem catalog
Description Title contents same as description contents
Example /2023702/35D943DF60D779EC9EF31F5DF...
Motivation Distorts search weightings
Checking Method Field comparison
Notes Record display: creator concatenated onto title
Metadata Scenario Basic Retrieval

12
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)
https://www.w3.org/TR/shacl/
SHACL (Shapes Constraint Language) is a language for describing
and constraining the contents of RDF graphs. SHACL groups these
descriptions and constraints into "shapes", which specify conditions
that apply at a given RDF node. Shapes provide a high-level
vocabulary to identify predicates and their associated cardinalities,
datatypes and other constraints.
 sh:equals, sh:notEquals
 sh:hasValue
 sh:in
 sh:lessThan, sh:lessThanOrEquals
 sh:minCount, sh:maxCount
 sh:minLength, sh:maxLength
 sh:pattern

13
Field frequency / main

14
Field frequency per collections / all

15
Field frequency per collections / >0%

16
Field frequency per collections / =100%

17
Field cardinality – overview

18
Field cardinality –histogram

19
Field cardinality – an outlier

20
Multilinguality
@ = language notation in RDF
resource notation
no language

21
Language frequency / barchart

22
Language frequency / barchart

23
Language frequency / Treemap

24
Language frequency / Treemap with resources

25
Language frequency / Treemap + interaction + table

26
Entropy – term uniqueness / main

27
Entropy – term uniqueness / collection

28
Entropy – term uniqueness / field value

29
Entropy – term uniqueness / terms

30
Problem catalog – Long subject

31
Problem catalog – Long subject – example (not so long...)
Conclusion: we
have to refine
the definition of
„long”

32
Problem catalog – same title and description

33
Problem catalog – same title and description – example

34
Record view – functionality matrix

35
Other elements of the record view

36
Further steps
 Building in completeness measurements to Europeana’s ingestion tool
 Including usage statistics (log files, Google Analitics API)
 Human evaluation of metadata quality
 Measuring timeliness (changes of scores over time)
 Machine learning:
 Classification/Clustering of records
 Statistical relevancy of measurements
 Göttingen use case: proposed SUB project „Shared Print Study”
 Göttingen use case: incorporating into research data management tool
 Cooperation with other projects

37
Architectural overview
Apache Spark
(Java)
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(PHP, d3.js)
Hadoop File
System
JSON files
Apache Solr
Apache
Cassandra
JSON files
JSON files
Image files
CSV files
CSV files
recent workflow
planned workflow

38
Articles, reports, presentations

39
Follow me
 Project plan and blog: http://pkiraly.github.io
 Site: http://144.76.218.178/europeana-qa/
 Software development:
 https://github.com/pkiraly/europeana-qa-spark: Europeana
Metadata Quality Assurance Toolkit
 https://github.com/pkiraly/europeana-qa-r: Europeana
Metadata Quality Assurance Toolkit
 @kiru, https://www.linkedin.com/in/peterkiraly

Metadata Quality Assurance Part II. The implementation begins

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Metadata Quality Assurance Part II. The implementation begins

Similar to Metadata Quality Assurance Part II. The implementation begins (20)

More from Péter Király

More from Péter Király (20)

Recently uploaded

Recently uploaded (20)

Metadata Quality Assurance Part II. The implementation begins