Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

Metadata quality in cultural heritage institutions
Péter Király {pkiraly@gwdg.de, @kiru, pkiraly.github.io}
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
ReIReS (Research Infrastructure on Religious Studies)
Workshop on FAIR Principle for Digital Research Data Management
Leibniz-Institute of European History, Mainz, 2018-11-28
these slides: http://bit.ly/qa-relres-fair

the problem
https://twitter.com/fxru/status/1052838758066868224
http://bit.ly/qa-relres-fair
2

top 20 patterns, ‘date’ field, MoMa collection
Harald Klinke (LMU München) https://twitter.com/HxxxKxxx/status/1066805548866289664
3

Generic title and bad thumbnail
4
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)

Multilinguality problem
5
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

Problems with title
6
title: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1",
description: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1"
Same title and description
title: "NLD-820630-AMSTERDAM:
Straatmuzikanten proberen
geld te verdienen voor...",
Machine-readable ID in title
title: "+++EMPTY+++"
Leftover

Measuring metadata quality. Non-informative values
7
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
bad good

Copy & paste cataloging
8
from a template?

metadata
structured information that describes, explains, locates, or otherwise
represents something else.
NISO (2004)
9

quality and ‘fitness for purpose’
★ fulfilment of a specification or stated outcomes
★ measured against what is seen to be the goal of the unit
★ achieving institutional mission and objectives
’We know it when we see it, but conveying the full bundle of assumptions and
experience that allow us to identify it is a different matter.’
10

metadata quality
11
purpose: to access content
no metadata
no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
bad metadata

the problem statement – improved
12
there are “good” and “bad” metadata records
we would like to achieve metrics like this:
functional requirements
good
acceptable
bad

general metrics
★ completeness: number of metadata elements filled out
★ accuracy: data correspond to the resource that is being described
★ consistency: values compliant to what is defined by the metadata scheme
★ objectiveness: values describe the resource in an unbiased way
★ appropriateness: values are facilitating the deployment of search
★ correctness: syntactically and grammatically correct language
Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014)
13

linked data dimensions and metrics
accessibility
★ Availability
★ Licensing
★ Interlinking
★ Security
★ Performance
intrinsic
★ Syntactic validity
★ Semantic
accuracy
★ Consistency
★ Conciseness
★ Completeness
contextual
★ Relevancy
★ Trustworthiness
★ Understandability
★ Timeliness
representational
★ Representational
conciseness
★ Interoperability
★ Interpretability
★ Versatility
Stvilia et al. (2007); Zaveri et al. (2015)
14

The good metrics are
★ clear
★ realistic
★ discriminating
★ measurable
★ universality
http://fairmetrics.org – https://github.com/FAIRMetrics/Metrics/blob/master/ALL.pdf
FAIR metrics
15

16

F1 – Identifier Uniqueness
What is being measured?
Whether there is a scheme to uniquely identify the digital resource.
How do we measure it?
An identifier scheme is valid if and only if it is described in a repository that can
register and present such identifier schemes (e.g. fairsharing.org).
17

F1 – Identifier persistence
Whether there is a policy that describes what the provider will do in the event an
identifier scheme becomes deprecated.
Use an HTTP GET on URL provided.
18

F2 – Machine-readability of metadata
The availability of machine-readable metadata that describes a digital resource.
HTTP GET on the metadata URL. A response of [a 200,202,203 or 206 HTTP
response after resolving all and any prior redirects. e.g. 301→302→200 OK]
indicates that there is indeed a document. The second URL should resolve to the
record of a registered file format (e.g. DCAT, DICOM, schema.org etc.) in a
registry like FAIRsharing. Future enhancements to FAIRsharing may include tags
that indicate whether or not a given file format is generally-agreed to be machine-
readable.
19

F3 – Resource Identifier in Metadata
Whether the metadata document contains the globally unique and persistent
identifier for the digital resource.
Parsing the metadata for the given digital resource GUID.
20

F4 – Indexed in a searchable resource
The degree to which the digital resource can be found using web-based search
engines.
We perform an HTTP GET on the URLs provided and attempt to to nd the
persistent identifier in the page that is returned. A second step might include
following each of the top XX hits and examine the resulting documents for
presence of the identifier.
21

A2 - Metadata Longevity
The existence of metadata even in the absence/removal of data
Resolve the URL
22

RDFUnit, SHACL and ShEx
★ Linked Data is based on Open World assumption
★ No “record”, no clear boundaries
★ RDF Data Shapes: reinventing the schema
★ ShEx (Shape Expressions, https://shex.io) and
SHACL (Shapes Constraint Language, https://www.w3.org/TR/shacl/)
★ Finding individual data issues
23

Core constraints
Cardinality minCount, maxCount
Types of values class, datatype, nodeKind
Shapes node, property, in, hasValue
Range of values minInclusive, maxInclusive, minExclusive, maxExclusive
String based minLength, maxLength, pattern, stem, uniqueLang
Logical constraints not, and, or, xone
Closed shapes closed, ignoredProperties
Property pair constraints equals, disjoint, lessThan, lessThanOrEquals
Non-validating constraints name, value, defaultValue
Qualified shapes qualifiedValueShape, qualifiedMinCount, qualifiedMaxCount
24

SHACL with BibFRAME
Capturing
Cataloger
Expectations
in
an
RDF
Editor.
Presentation
at
SWIB
2018
by
S.
Folsom,
H.
Khan,
L.
Rayle,
J.
Kovari,
R.
Younes,
S.
Warner
https://twitter.com/sf433/status/1067370567303614464

The Quartz guide to bad data (2015)
★ by Christopher Groskopf
★ guide for data journalist about how to recognize data issues
★ practical guide, not an academic paper
★ take-away messages:
○ be sceptic about the data
○ check it with exploratory data analysis
○ check it early, check it often
★ https://github.com/Quartz/bad-data-guide, https://qz.com/572338/the-quartz-
guide-to-bad-data/
26

Issues that your source should solve
★ Values are missing
★ Zeros replace missing values
★ Data are missing you know should
be there
★ Rows or values are duplicated
★ Spelling is inconsistent
★ Name order is inconsistent
★ Date formats are inconsistent
★ Units are not specified
★ Categories are badly chosen
★ Field names are ambiguous
★ Provenance is not documented
★ Suspicious numbers are present
★ Data are too coarse
★ Totals differ from published
aggregates
★ Spreadsheet has 65536 rows
★ Spreadsheet has dates in 1900 or
1904
★ Text has been converted to
numbers
27

Issues that you should solve
★ Text is garbled
★ Data are in a PDF
★ Data are too granular
★ Data was entered by humans
★ Aggregations were computed on
missing values
★ Sample is not random
★ Margin-of-error is too large
★ Margin-of-error is unknown
★ Sample is biased
★ Data has been manually edited
★ Inflation skews the data
★ Natural/seasonal variation skews
the data
★ Timeframe has been manipulated
★ Frame of reference has been
manipulated
28

Issues a third-party expert should help you solve
★ Author is untrustworthy
★ Collection process is opaque
★ Data asserts unrealistic precision
★ There are inexplicable outliers
★ An index masks underlying variation
★ Results have been p-hacked
★ Benford’s Law fails
★ It’s too good to be true
29

Issues a programmer should help you solve
★ Data are aggregated to the wrong categories or geographies
★ Data are in scanned documents
30

https://www.zotero.org/groups/488224/metadata_assessment

in practice
part II
32

hypothesis
33
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell

purposes
34
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”

Measuring Europeana

data aggregation workflow (organizational)
LAM inst. 1
aggregator 1
Europeana
LAM inst. 2
LAM inst. ...
aggregator ...
LAM inst. ...
36

data aggregation workflow (technical)
37
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...

organisational proposal
38
Europeana Data Quality Committee
★ Analysing/revising metadata schema
★ Functional requirement analysis
★ Problem catalog
★ Multilinguality

technical proposal
39
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source

measuring workflow
40
★ OAI-PMH
★ Europeana API
★ Hadoop
★ NoSQL
★ Spark
★ Hadoop
★ Java
★ Apache Solr
★ Spark
★ R
★ PHP
★ D3.js
★ highchart.js
★ NoSQL
json csv json, png html, svg
ingest measure statistical
analysis
web
interface

What to measure?
41
★Structural and semantic features
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (generic metrics)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems

metadata requirements / user scenarios
42
“As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.”
Metadata analysis
Description of relevant metadata elements and their rules
Measurement rules
★ the relevant field values should be resolvable URI
★ each URI should be associated with labels in multiple languages

measurement
43
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurements
aggregated statistics
metrics

multilinguality
44
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject:
Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org
/2921044/federal-republic-of-germany)
0
1
2
n

multilinguality – details
45
<#record> a ore:Proxy ;
dc:subject “Ballet”, “Opera” .
<#record> a ore:Proxy ; edm:europeanaProxy true ;
dc:subject <http://data.europeana.eu/concept/base/264>
, <http://data.europeana.eu/concept/base/247> .
<http://data.europeana.eu/concept/base/264> a skos:Concept .
skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru
, "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv .
<http://data.europeana.eu/concept/base/247>
skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi
, "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
0
0
11 19
Distinct languages Tagged literals 1,7 Literals per language
dereferencing

a good multilingual example
46
dc:description
dc:title
Place/skos:prefLabel
Descriptive fields Subject headings
"Brandenburger Tor"@de
"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de
"Postdamer Platz border crossing"@en
"Reichstag"@de
"Reichstag building"@en
"Die Mauer muß weg!"@de
"Die Mauer muß weg! (The
Wall must go!)"@en
"Kommentiertes Fotorama mit
Bildern von 1989-1990 in
Berlin"@de
"Annotated images from 1989-
1990 in Berlin"@en

canned demo
47

Measuring library catalogs
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG

Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 16th anniversary
last month, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
64

an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
65

Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
66

Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
67

Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
68

Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
69

Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
70

record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
71

validating individual records
./validator [file]
001999999 852 undefined subfield L
https://www.loc.gov/...
002000005 035 undefined subfield 9
002000008 035 undefined subfield 9
https://www.loc.gov/… 72

summary of errors
./validator --summary [file]
006/01-02 (tag006music01): invalid value ' ' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times)
006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2
times)
006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2
times)
73

other options
./validator --marcVersion “GENT” [file]
./validator --format “tsv” [file]
./validator --defaultRecordType “BOOKS” [file]
SEVERE: Error with record '002066968'. Leader/06
(typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm'
./validator --fileName “my-report” [file]
./validator ... [file] | catmandu … | RScript … | python … | grep ...
74

viewing/filtering/selecting records
Displaying record with given ID
./formatter --id “002032820” [file]
Displaying records matching a query
./formatter --search ‘245$c=Shakespeare’ [file]
Retrieve given elements
./formatter --selector ‘245$c’ [file]
75

variation to weighted completeness
Thompson and Traill (2017)
76

calculating Thompson-Traill completeness
./tt-completeness [options] [file]
output:
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date
008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of
Resource,Country of Publication,noLanguageOrEnglish,RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
"010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7
77

K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
78

Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
79
How
to
name
the
fields?

accessing every record element
81

Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
82

http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html
Usage in DH
Benjamin Smith (2017) A brief
visual history of MARC cataloging
at the Library of Congress.
1. extract fields from MARC
2. data cleaning
3. visualize with R
83

./formatter --selector "260c;008~0-5" [file] > dates.tsv
or put into a cleaning pileline
./formatter --selector "260c;008~0-5" [file]
| sed ... | grep ... | awk ...
> dates.tsv
Extract data
260c 008~0-
5
1977. 780804
1977. 781121
[1973]. 740215
publication record
1977 1978-08-04
1977 1978-11-21
1973 1974-02-15
84

Filtering out extreme values
data %>%
filter(publication > 2018) %>%
arrange(desc(publication))
publication record
<int> <int>
1 5732 1990
2 4185 2013
3 2201 2012
4 2030 2015
5 2022 2016
6 2020 2011
7 2019 2015
85

cataloging
frontline
intensive backward
cataloging -
maybe importing?
backward
cataloging is still
intensive, the
tendency continues
peak is > 13K
2000-07-10, the “golden day”:
95K new records
forward cataloging
86

87

reproducibility of science
❏ accessing users (first one: Gent)
❏ making easy of usage (downloadable binaries, helper scripts,
documentation)
❏ distribution via Maven Central
❏ continuous integration (Travis CI)
❏ code coverage report
❏ list of freely reusable library catalogs
❏ licencing (GPL-3.0)
88

available catalogs to measure
89
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources

future work
❏ implementing more validation rules
❏ visual dashboard
❏ communication with catalogers
❏ writing articles/dissertation
90

authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans
en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
Kris Coremans is missing!
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
91

everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
http://pkiraly.github.io
https://twitter.com/kiru
peter.kiraly@gwdg.de
92

Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

Recommended

Recommended

More Related Content

Similar to Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

Similar to Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018) (20)

More from Péter Király

More from Péter Király (20)

Recently uploaded

Recently uploaded (20)

Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)