SlideShare a Scribd company logo
Metadata quality in cultural heritage institutions
Péter Király {pkiraly@gwdg.de, @kiru, pkiraly.github.io}
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
ReIReS (Research Infrastructure on Religious Studies)
Workshop on FAIR Principle for Digital Research Data Management
Leibniz-Institute of European History, Mainz, 2018-11-28
these slides: http://bit.ly/qa-relres-fair
the problem
https://twitter.com/fxru/status/1052838758066868224
http://bit.ly/qa-relres-fair
2
top 20 patterns, ‘date’ field, MoMa collection
Harald Klinke (LMU München) https://twitter.com/HxxxKxxx/status/1066805548866289664
http://bit.ly/qa-relres-fair
3
Generic title and bad thumbnail
4
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
http://bit.ly/qa-relres-fair
Multilinguality problem
5
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
http://bit.ly/qa-relres-fair
Problems with title
6
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
title: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1",
description: "VOETBAL-EREDIVISIE-
FEYENOORD - GO AHEAD 3-1"
Same title and description
title: "NLD-820630-AMSTERDAM:
Straatmuzikanten proberen
geld te verdienen voor...",
Machine-readable ID in title
title: "+++EMPTY+++"
Leftover
http://bit.ly/qa-relres-fair
Measuring metadata quality. Non-informative values
7
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
bad good
http://bit.ly/qa-relres-fair
Copy & paste cataloging
8
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
http://bit.ly/qa-relres-fair
metadata
structured information that describes, explains, locates, or otherwise
represents something else.
NISO (2004)
http://bit.ly/qa-relres-fair
9
quality and ‘fitness for purpose’
★ fulfilment of a specification or stated outcomes
★ measured against what is seen to be the goal of the unit
★ achieving institutional mission and objectives
’We know it when we see it, but conveying the full bundle of assumptions and
experience that allow us to identify it is a different matter.’
http://bit.ly/qa-relres-fair
10
metadata quality
11
purpose: to access content
no metadata
no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
bad metadata
http://bit.ly/qa-relres-fair
the problem statement – improved
12
there are “good” and “bad” metadata records
we would like to achieve metrics like this:
functional requirements
good
acceptable
bad
http://bit.ly/qa-relres-fair
general metrics
★ completeness: number of metadata elements filled out
★ accuracy: data correspond to the resource that is being described
★ consistency: values compliant to what is defined by the metadata scheme
★ objectiveness: values describe the resource in an unbiased way
★ appropriateness: values are facilitating the deployment of search
★ correctness: syntactically and grammatically correct language
Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014)
http://bit.ly/qa-relres-fair
13
linked data dimensions and metrics
accessibility
★ Availability
★ Licensing
★ Interlinking
★ Security
★ Performance
intrinsic
★ Syntactic validity
★ Semantic
accuracy
★ Consistency
★ Conciseness
★ Completeness
contextual
★ Relevancy
★ Trustworthiness
★ Understandability
★ Timeliness
representational
★ Representational
conciseness
★ Interoperability
★ Interpretability
★ Versatility
Stvilia et al. (2007); Zaveri et al. (2015)
http://bit.ly/qa-relres-fair
14
The good metrics are
★ clear
★ realistic
★ discriminating
★ measurable
★ universality
http://fairmetrics.org – https://github.com/FAIRMetrics/Metrics/blob/master/ALL.pdf
FAIR metrics
http://bit.ly/qa-relres-fair
15
http://bit.ly/qa-relres-fair
16
F1 – Identifier Uniqueness
What is being measured?
Whether there is a scheme to uniquely identify the digital resource.
How do we measure it?
An identifier scheme is valid if and only if it is described in a repository that can
register and present such identifier schemes (e.g. fairsharing.org).
http://bit.ly/qa-relres-fair
17
F1 – Identifier persistence
What is being measured?
Whether there is a policy that describes what the provider will do in the event an
identifier scheme becomes deprecated.
How do we measure it?
Use an HTTP GET on URL provided.
http://bit.ly/qa-relres-fair
18
F2 – Machine-readability of metadata
What is being measured?
The availability of machine-readable metadata that describes a digital resource.
How do we measure it?
HTTP GET on the metadata URL. A response of [a 200,202,203 or 206 HTTP
response after resolving all and any prior redirects. e.g. 301→302→200 OK]
indicates that there is indeed a document. The second URL should resolve to the
record of a registered file format (e.g. DCAT, DICOM, schema.org etc.) in a
registry like FAIRsharing. Future enhancements to FAIRsharing may include tags
that indicate whether or not a given file format is generally-agreed to be machine-
readable.
http://bit.ly/qa-relres-fair
19
F3 – Resource Identifier in Metadata
What is being measured?
Whether the metadata document contains the globally unique and persistent
identifier for the digital resource.
How do we measure it?
Parsing the metadata for the given digital resource GUID.
http://bit.ly/qa-relres-fair
20
F4 – Indexed in a searchable resource
What is being measured?
The degree to which the digital resource can be found using web-based search
engines.
How do we measure it?
We perform an HTTP GET on the URLs provided and attempt to to nd the
persistent identifier in the page that is returned. A second step might include
following each of the top XX hits and examine the resulting documents for
presence of the identifier.
http://bit.ly/qa-relres-fair
21
A2 - Metadata Longevity
What is being measured?
The existence of metadata even in the absence/removal of data
How do we measure it?
Resolve the URL
http://bit.ly/qa-relres-fair
22
RDFUnit, SHACL and ShEx
★ Linked Data is based on Open World assumption
★ No “record”, no clear boundaries
★ RDF Data Shapes: reinventing the schema
★ ShEx (Shape Expressions, https://shex.io) and
SHACL (Shapes Constraint Language, https://www.w3.org/TR/shacl/)
★ Finding individual data issues
http://bit.ly/qa-relres-fair
23
Core constraints
Cardinality minCount, maxCount
Types of values class, datatype, nodeKind
Shapes node, property, in, hasValue
Range of values minInclusive, maxInclusive, minExclusive, maxExclusive
String based minLength, maxLength, pattern, stem, uniqueLang
Logical constraints not, and, or, xone
Closed shapes closed, ignoredProperties
Property pair constraints equals, disjoint, lessThan, lessThanOrEquals
Non-validating constraints name, value, defaultValue
Qualified shapes qualifiedValueShape, qualifiedMinCount, qualifiedMaxCount
24
SHACL with BibFRAME
Capturing
Cataloger
Expectations
in
an
RDF
Editor.
Presentation
at
SWIB
2018
by
S.
Folsom,
H.
Khan,
L.
Rayle,
J.
Kovari,
R.
Younes,
S.
Warner
https://twitter.com/sf433/status/1067370567303614464
The Quartz guide to bad data (2015)
★ by Christopher Groskopf
★ guide for data journalist about how to recognize data issues
★ practical guide, not an academic paper
★ take-away messages:
○ be sceptic about the data
○ check it with exploratory data analysis
○ check it early, check it often
★ https://github.com/Quartz/bad-data-guide, https://qz.com/572338/the-quartz-
guide-to-bad-data/
http://bit.ly/qa-relres-fair
26
Issues that your source should solve
★ Values are missing
★ Zeros replace missing values
★ Data are missing you know should
be there
★ Rows or values are duplicated
★ Spelling is inconsistent
★ Name order is inconsistent
★ Date formats are inconsistent
★ Units are not specified
★ Categories are badly chosen
★ Field names are ambiguous
★ Provenance is not documented
★ Suspicious numbers are present
★ Data are too coarse
★ Totals differ from published
aggregates
★ Spreadsheet has 65536 rows
★ Spreadsheet has dates in 1900 or
1904
★ Text has been converted to
numbers
http://bit.ly/qa-relres-fair
27
Issues that you should solve
★ Text is garbled
★ Data are in a PDF
★ Data are too granular
★ Data was entered by humans
★ Aggregations were computed on
missing values
★ Sample is not random
★ Margin-of-error is too large
★ Margin-of-error is unknown
★ Sample is biased
★ Data has been manually edited
★ Inflation skews the data
★ Natural/seasonal variation skews
the data
★ Timeframe has been manipulated
★ Frame of reference has been
manipulated
http://bit.ly/qa-relres-fair
28
Issues a third-party expert should help you solve
★ Author is untrustworthy
★ Collection process is opaque
★ Data asserts unrealistic precision
★ There are inexplicable outliers
★ An index masks underlying variation
★ Results have been p-hacked
★ Benford’s Law fails
★ It’s too good to be true
http://bit.ly/qa-relres-fair
29
Issues a programmer should help you solve
★ Data are aggregated to the wrong categories or geographies
★ Data are in scanned documents
http://bit.ly/qa-relres-fair
30
https://www.zotero.org/groups/488224/metadata_assessment
in practice
part II
http://bit.ly/qa-relres-fair
32
hypothesis
33
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell
http://bit.ly/qa-relres-fair
purposes
34
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”
http://bit.ly/qa-relres-fair
Measuring Europeana
http://bit.ly/qa-relres-fair
data aggregation workflow (organizational)
LAM inst. 1
aggregator 1
Europeana
LAM inst. 2
LAM inst. ...
aggregator ...
LAM inst. ...
http://bit.ly/qa-relres-fair
36
data aggregation workflow (technical)
37
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...
http://bit.ly/qa-relres-fair
organisational proposal
38
Europeana Data Quality Committee
★ Analysing/revising metadata schema
★ Functional requirement analysis
★ Problem catalog
★ Multilinguality
http://bit.ly/qa-relres-fair
technical proposal
39
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
http://bit.ly/qa-relres-fair
measuring workflow
40
★ OAI-PMH
★ Europeana API
★ Hadoop
★ NoSQL
★ Spark
★ Hadoop
★ Java
★ Apache Solr
★ Spark
★ R
★ PHP
★ D3.js
★ highchart.js
★ NoSQL
json csv json, png html, svg
ingest measure statistical
analysis
web
interface
http://bit.ly/qa-relres-fair
What to measure?
41
★Structural and semantic features
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (generic metrics)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems
http://bit.ly/qa-relres-fair
metadata requirements / user scenarios
42
“As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.”
Metadata analysis
Description of relevant metadata elements and their rules
Measurement rules
★ the relevant field values should be resolvable URI
★ each URI should be associated with labels in multiple languages
http://bit.ly/qa-relres-fair
measurement
43
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurements
aggregated statistics
metrics
http://bit.ly/qa-relres-fair
multilinguality
44
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject:
Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org
/2921044/federal-republic-of-germany)
0
1
2
n
http://bit.ly/qa-relres-fair
multilinguality – details
45
<#record> a ore:Proxy ;
dc:subject “Ballet”, “Opera” .
<#record> a ore:Proxy ; edm:europeanaProxy true ;
dc:subject <http://data.europeana.eu/concept/base/264>
, <http://data.europeana.eu/concept/base/247> .
<http://data.europeana.eu/concept/base/264> a skos:Concept .
skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru
, "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv .
<http://data.europeana.eu/concept/base/247>
skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi
, "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
0
0
11 19
Distinct languages Tagged literals 1,7 Literals per language
dereferencing
http://bit.ly/qa-relres-fair
a good multilingual example
46
dc:description
dc:title
Place/skos:prefLabel
Descriptive fields Subject headings
"Brandenburger Tor"@de
"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de
"Postdamer Platz border crossing"@en
"Reichstag"@de
"Reichstag building"@en
"Die Mauer muß weg!"@de
"Die Mauer muß weg! (The
Wall must go!)"@en
"Kommentiertes Fotorama mit
Bildern von 1989-1990 in
Berlin"@de
"Annotated images from 1989-
1990 in Berlin"@en
http://bit.ly/qa-relres-fair
canned demo
http://bit.ly/qa-relres-fair
47
Measuring library catalogs
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
http://bit.ly/qa-relres-fair
Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 16th anniversary
last month, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
64
http://bit.ly/qa-relres-fair
an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
65
http://bit.ly/qa-relres-fair
Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
66
http://bit.ly/qa-relres-fair
Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
67
http://bit.ly/qa-relres-fair
Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
68
http://bit.ly/qa-relres-fair
Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
69
http://bit.ly/qa-relres-fair
Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
70
http://bit.ly/qa-relres-fair
record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
71
http://bit.ly/qa-relres-fair
validating individual records
./validator [file]
001999999 852 undefined subfield L
https://www.loc.gov/...
002000005 035 undefined subfield 9
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000008 035 undefined subfield 9
https://www.loc.gov/… 72
http://bit.ly/qa-relres-fair
summary of errors
./validator --summary [file]
006/01-02 (tag006music01): invalid value ' ' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times)
006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2
times)
006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2
times)
73
http://bit.ly/qa-relres-fair
other options
./validator --marcVersion “GENT” [file]
./validator --format “tsv” [file]
./validator --defaultRecordType “BOOKS” [file]
SEVERE: Error with record '002066968'. Leader/06
(typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm'
./validator --fileName “my-report” [file]
./validator ... [file] | catmandu … | RScript … | python … | grep ...
74
http://bit.ly/qa-relres-fair
viewing/filtering/selecting records
Displaying record with given ID
./formatter --id “002032820” [file]
Displaying records matching a query
./formatter --search ‘245$c=Shakespeare’ [file]
Retrieve given elements
./formatter --selector ‘245$c’ [file]
75
http://bit.ly/qa-relres-fair
variation to weighted completeness
Thompson and Traill (2017)
76
http://bit.ly/qa-relres-fair
calculating Thompson-Traill completeness
./tt-completeness [options] [file]
output:
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date
008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of
Resource,Country of Publication,noLanguageOrEnglish,RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
"010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7
77
http://bit.ly/qa-relres-fair
K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
78
http://bit.ly/qa-relres-fair
Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
79
How
to
name
the
fields?
http://bit.ly/qa-relres-fair
Facetted search interface
80
accessing every record element
81
http://bit.ly/qa-relres-fair
Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
82
http://bit.ly/qa-relres-fair
http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html
Usage in DH
Benjamin Smith (2017) A brief
visual history of MARC cataloging
at the Library of Congress.
1. extract fields from MARC
2. data cleaning
3. visualize with R
83
http://bit.ly/qa-relres-fair
./formatter --selector "260c;008~0-5" [file] > dates.tsv
or put into a cleaning pileline
./formatter --selector "260c;008~0-5" [file] 
| sed ... | grep ... | awk ... 
> dates.tsv
Extract data
260c 008~0-
5
1977. 780804
1977. 781121
[1973]. 740215
publication record
1977 1978-08-04
1977 1978-11-21
1973 1974-02-15
84
http://bit.ly/qa-relres-fair
Filtering out extreme values
data %>%
filter(publication > 2018) %>%
arrange(desc(publication))
publication record
<int> <int>
1 5732 1990
2 4185 2013
3 2201 2012
4 2030 2015
5 2022 2016
6 2020 2011
7 2019 2015
85
http://bit.ly/qa-relres-fair
cataloging
frontline
intensive backward
cataloging -
maybe importing?
backward
cataloging is still
intensive, the
tendency continues
peak is > 13K
2000-07-10, the “golden day”:
95K new records
forward cataloging
86
http://bit.ly/qa-relres-fair
87
http://bit.ly/qa-relres-fair
reproducibility of science
❏ accessing users (first one: Gent)
❏ making easy of usage (downloadable binaries, helper scripts,
documentation)
❏ distribution via Maven Central
❏ continuous integration (Travis CI)
❏ code coverage report
❏ list of freely reusable library catalogs
❏ licencing (GPL-3.0)
88
http://bit.ly/qa-relres-fair
available catalogs to measure
89
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources
http://bit.ly/qa-relres-fair
future work
❏ implementing more validation rules
❏ visual dashboard
❏ communication with catalogers
❏ writing articles/dissertation
90
http://bit.ly/qa-relres-fair
authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans
en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
Kris Coremans is missing!
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
91
http://bit.ly/qa-relres-fair
everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
http://pkiraly.github.io
https://twitter.com/kiru
peter.kiraly@gwdg.de
92
http://bit.ly/qa-relres-fair

More Related Content

Similar to Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
Ruben Schalk
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
Andy Stretton
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Péter Király
 
FAIR principles and metrics for evaluation
FAIR principles and metrics for evaluationFAIR principles and metrics for evaluation
FAIR principles and metrics for evaluation
Michel Dumontier
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
Jisc RDM
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
 
How 2019 became the year FAIR landed in biopharmaceutical R&D
How 2019 became the year FAIR landed in biopharmaceutical R&DHow 2019 became the year FAIR landed in biopharmaceutical R&D
How 2019 became the year FAIR landed in biopharmaceutical R&D
Kees van Bochove
 
FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1
Mark Wilkinson
 
Data Quality
Data QualityData Quality
Data Quality
jerdeb
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier Linking
Maaike Duine
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
KannanThangavelu2
 
Towards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessTowards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRness
Michel Dumontier
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
DataONE
 
Semantically-Enabled Digital Investigations
Semantically-Enabled Digital InvestigationsSemantically-Enabled Digital Investigations
Semantically-Enabled Digital Investigations
inbroker
 
How we can understand the world through open data
How we can understand the world through open dataHow we can understand the world through open data
How we can understand the world through open data
Marie Gustafsson Friberger
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
thplayer127
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
Bernadette Hyland-Wood
 

Similar to Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018) (20)

Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 
FAIR principles and metrics for evaluation
FAIR principles and metrics for evaluationFAIR principles and metrics for evaluation
FAIR principles and metrics for evaluation
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
How 2019 became the year FAIR landed in biopharmaceutical R&D
How 2019 became the year FAIR landed in biopharmaceutical R&DHow 2019 became the year FAIR landed in biopharmaceutical R&D
How 2019 became the year FAIR landed in biopharmaceutical R&D
 
FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1
 
Data Quality
Data QualityData Quality
Data Quality
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier Linking
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Towards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessTowards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRness
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Semantically-Enabled Digital Investigations
Semantically-Enabled Digital InvestigationsSemantically-Enabled Digital Investigations
Semantically-Enabled Digital Investigations
 
How we can understand the world through open data
How we can understand the world through open dataHow we can understand the world through open data
How we can understand the world through open data
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache Spark
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 

More from Péter Király

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Péter Király
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
Péter Király
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
Péter Király
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
Péter Király
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Péter Király
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
Péter Király
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
Péter Király
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
Péter Király
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Péter Király
 

More from Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 

Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)

  • 1. Metadata quality in cultural heritage institutions Péter Király {pkiraly@gwdg.de, @kiru, pkiraly.github.io} Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) ReIReS (Research Infrastructure on Religious Studies) Workshop on FAIR Principle for Digital Research Data Management Leibniz-Institute of European History, Mainz, 2018-11-28 these slides: http://bit.ly/qa-relres-fair
  • 3. top 20 patterns, ‘date’ field, MoMa collection Harald Klinke (LMU München) https://twitter.com/HxxxKxxx/status/1066805548866289664 http://bit.ly/qa-relres-fair 3
  • 4. Generic title and bad thumbnail 4 more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) http://bit.ly/qa-relres-fair
  • 5. Multilinguality problem 5 ★ Mona Lisa → 456 results ★ La Gioconda → 365 results ★ La Joconde → 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html http://bit.ly/qa-relres-fair
  • 6. Problems with title 6 more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) title: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1", description: "VOETBAL-EREDIVISIE- FEYENOORD - GO AHEAD 3-1" Same title and description title: "NLD-820630-AMSTERDAM: Straatmuzikanten proberen geld te verdienen voor...", Machine-readable ID in title title: "+++EMPTY+++" Leftover http://bit.ly/qa-relres-fair
  • 7. Measuring metadata quality. Non-informative values 7 non informative dc:title: “photograph, framed”, “group photograph” “photograph” informative dc:title: “Photograph of Sir Dugald Clerk”, “Photograph of "Puffing Billy"” bad good http://bit.ly/qa-relres-fair
  • 8. Copy & paste cataloging 8 from a template? more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) http://bit.ly/qa-relres-fair
  • 9. metadata structured information that describes, explains, locates, or otherwise represents something else. NISO (2004) http://bit.ly/qa-relres-fair 9
  • 10. quality and ‘fitness for purpose’ ★ fulfilment of a specification or stated outcomes ★ measured against what is seen to be the goal of the unit ★ achieving institutional mission and objectives ’We know it when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter.’ http://bit.ly/qa-relres-fair 10
  • 11. metadata quality 11 purpose: to access content no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ bad metadata http://bit.ly/qa-relres-fair
  • 12. the problem statement – improved 12 there are “good” and “bad” metadata records we would like to achieve metrics like this: functional requirements good acceptable bad http://bit.ly/qa-relres-fair
  • 13. general metrics ★ completeness: number of metadata elements filled out ★ accuracy: data correspond to the resource that is being described ★ consistency: values compliant to what is defined by the metadata scheme ★ objectiveness: values describe the resource in an unbiased way ★ appropriateness: values are facilitating the deployment of search ★ correctness: syntactically and grammatically correct language Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014) http://bit.ly/qa-relres-fair 13
  • 14. linked data dimensions and metrics accessibility ★ Availability ★ Licensing ★ Interlinking ★ Security ★ Performance intrinsic ★ Syntactic validity ★ Semantic accuracy ★ Consistency ★ Conciseness ★ Completeness contextual ★ Relevancy ★ Trustworthiness ★ Understandability ★ Timeliness representational ★ Representational conciseness ★ Interoperability ★ Interpretability ★ Versatility Stvilia et al. (2007); Zaveri et al. (2015) http://bit.ly/qa-relres-fair 14
  • 15. The good metrics are ★ clear ★ realistic ★ discriminating ★ measurable ★ universality http://fairmetrics.org – https://github.com/FAIRMetrics/Metrics/blob/master/ALL.pdf FAIR metrics http://bit.ly/qa-relres-fair 15
  • 17. F1 – Identifier Uniqueness What is being measured? Whether there is a scheme to uniquely identify the digital resource. How do we measure it? An identifier scheme is valid if and only if it is described in a repository that can register and present such identifier schemes (e.g. fairsharing.org). http://bit.ly/qa-relres-fair 17
  • 18. F1 – Identifier persistence What is being measured? Whether there is a policy that describes what the provider will do in the event an identifier scheme becomes deprecated. How do we measure it? Use an HTTP GET on URL provided. http://bit.ly/qa-relres-fair 18
  • 19. F2 – Machine-readability of metadata What is being measured? The availability of machine-readable metadata that describes a digital resource. How do we measure it? HTTP GET on the metadata URL. A response of [a 200,202,203 or 206 HTTP response after resolving all and any prior redirects. e.g. 301→302→200 OK] indicates that there is indeed a document. The second URL should resolve to the record of a registered file format (e.g. DCAT, DICOM, schema.org etc.) in a registry like FAIRsharing. Future enhancements to FAIRsharing may include tags that indicate whether or not a given file format is generally-agreed to be machine- readable. http://bit.ly/qa-relres-fair 19
  • 20. F3 – Resource Identifier in Metadata What is being measured? Whether the metadata document contains the globally unique and persistent identifier for the digital resource. How do we measure it? Parsing the metadata for the given digital resource GUID. http://bit.ly/qa-relres-fair 20
  • 21. F4 – Indexed in a searchable resource What is being measured? The degree to which the digital resource can be found using web-based search engines. How do we measure it? We perform an HTTP GET on the URLs provided and attempt to to nd the persistent identifier in the page that is returned. A second step might include following each of the top XX hits and examine the resulting documents for presence of the identifier. http://bit.ly/qa-relres-fair 21
  • 22. A2 - Metadata Longevity What is being measured? The existence of metadata even in the absence/removal of data How do we measure it? Resolve the URL http://bit.ly/qa-relres-fair 22
  • 23. RDFUnit, SHACL and ShEx ★ Linked Data is based on Open World assumption ★ No “record”, no clear boundaries ★ RDF Data Shapes: reinventing the schema ★ ShEx (Shape Expressions, https://shex.io) and SHACL (Shapes Constraint Language, https://www.w3.org/TR/shacl/) ★ Finding individual data issues http://bit.ly/qa-relres-fair 23
  • 24. Core constraints Cardinality minCount, maxCount Types of values class, datatype, nodeKind Shapes node, property, in, hasValue Range of values minInclusive, maxInclusive, minExclusive, maxExclusive String based minLength, maxLength, pattern, stem, uniqueLang Logical constraints not, and, or, xone Closed shapes closed, ignoredProperties Property pair constraints equals, disjoint, lessThan, lessThanOrEquals Non-validating constraints name, value, defaultValue Qualified shapes qualifiedValueShape, qualifiedMinCount, qualifiedMaxCount 24
  • 26. The Quartz guide to bad data (2015) ★ by Christopher Groskopf ★ guide for data journalist about how to recognize data issues ★ practical guide, not an academic paper ★ take-away messages: ○ be sceptic about the data ○ check it with exploratory data analysis ○ check it early, check it often ★ https://github.com/Quartz/bad-data-guide, https://qz.com/572338/the-quartz- guide-to-bad-data/ http://bit.ly/qa-relres-fair 26
  • 27. Issues that your source should solve ★ Values are missing ★ Zeros replace missing values ★ Data are missing you know should be there ★ Rows or values are duplicated ★ Spelling is inconsistent ★ Name order is inconsistent ★ Date formats are inconsistent ★ Units are not specified ★ Categories are badly chosen ★ Field names are ambiguous ★ Provenance is not documented ★ Suspicious numbers are present ★ Data are too coarse ★ Totals differ from published aggregates ★ Spreadsheet has 65536 rows ★ Spreadsheet has dates in 1900 or 1904 ★ Text has been converted to numbers http://bit.ly/qa-relres-fair 27
  • 28. Issues that you should solve ★ Text is garbled ★ Data are in a PDF ★ Data are too granular ★ Data was entered by humans ★ Aggregations were computed on missing values ★ Sample is not random ★ Margin-of-error is too large ★ Margin-of-error is unknown ★ Sample is biased ★ Data has been manually edited ★ Inflation skews the data ★ Natural/seasonal variation skews the data ★ Timeframe has been manipulated ★ Frame of reference has been manipulated http://bit.ly/qa-relres-fair 28
  • 29. Issues a third-party expert should help you solve ★ Author is untrustworthy ★ Collection process is opaque ★ Data asserts unrealistic precision ★ There are inexplicable outliers ★ An index masks underlying variation ★ Results have been p-hacked ★ Benford’s Law fails ★ It’s too good to be true http://bit.ly/qa-relres-fair 29
  • 30. Issues a programmer should help you solve ★ Data are aggregated to the wrong categories or geographies ★ Data are in scanned documents http://bit.ly/qa-relres-fair 30
  • 33. hypothesis 33 by measuring structural elements we can approximate metadata record quality ≃ metadata smell http://bit.ly/qa-relres-fair
  • 34. purposes 34 ★improve the metadata ★services: good data → reliable functions ★better metadata schema & documentation ★propagate “good practice” http://bit.ly/qa-relres-fair
  • 36. data aggregation workflow (organizational) LAM inst. 1 aggregator 1 Europeana LAM inst. 2 LAM inst. ... aggregator ... LAM inst. ... http://bit.ly/qa-relres-fair 36
  • 37. data aggregation workflow (technical) 37 data transformations Europeana Data Model (EDM) Dublin Core, LIDO, EAD, MARC, EDM custom, ... http://bit.ly/qa-relres-fair
  • 38. organisational proposal 38 Europeana Data Quality Committee ★ Analysing/revising metadata schema ★ Functional requirement analysis ★ Problem catalog ★ Multilinguality http://bit.ly/qa-relres-fair
  • 39. technical proposal 39 “Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ adaptable to different metadata schemes ★ scalable (to Big Data) ★ understandable reports for data curators ★ open source http://bit.ly/qa-relres-fair
  • 40. measuring workflow 40 ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg ingest measure statistical analysis web interface http://bit.ly/qa-relres-fair
  • 41. What to measure? 41 ★Structural and semantic features Completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (generic metrics) ★Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★Problem catalog Known metadata problems http://bit.ly/qa-relres-fair
  • 42. metadata requirements / user scenarios 42 “As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.” Metadata analysis Description of relevant metadata elements and their rules Measurement rules ★ the relevant field values should be resolvable URI ★ each URI should be associated with labels in multiple languages http://bit.ly/qa-relres-fair
  • 43. measurement 43 overall view collection view record view Completeness Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc. links measurements aggregated statistics metrics http://bit.ly/qa-relres-fair
  • 44. multilinguality 44 Text w/o language annotation (dc.subject: Germany): Text w language annotation (dc.subject: Germany@en) Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/federal-republic-of-germany) 0 1 2 n http://bit.ly/qa-relres-fair
  • 45. multilinguality – details 45 <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera” . <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <http://data.europeana.eu/concept/base/264> , <http://data.europeana.eu/concept/base/247> . <http://data.europeana.eu/concept/base/264> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . <http://data.europeana.eu/concept/base/247> skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt . 0 0 11 19 Distinct languages Tagged literals 1,7 Literals per language dereferencing http://bit.ly/qa-relres-fair
  • 46. a good multilingual example 46 dc:description dc:title Place/skos:prefLabel Descriptive fields Subject headings "Brandenburger Tor"@de "Brandenburg Gate"@en "Grenzübergang Potsdamer Platz"@de "Postdamer Platz border crossing"@en "Reichstag"@de "Reichstag building"@en "Die Mauer muß weg!"@de "Die Mauer muß weg! (The Wall must go!)"@en "Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de "Annotated images from 1989- 1990 in Berlin"@en http://bit.ly/qa-relres-fair
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63. Measuring library catalogs Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-relres-fair
  • 64. Part I. Introduction to MARC ❏ MAchine Readable Catalog ❏ format and semantic specification ❏ comes from the age of punchcards - information compression ❏ invented in early 60’s ❏ even the lapidary “MARC must die” article* celebrated its 16th anniversary last month, but MARC is still living ❏ „There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” * by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ 64 http://bit.ly/qa-relres-fair
  • 65. an example LEADER 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 65 http://bit.ly/qa-relres-fair
  • 66. Positional fields - Leader 00928nam a2200265 c 4500 0 1 2 01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3 00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0 ❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999) ❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new” ❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material” ❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item” ❏ ... 66 http://bit.ly/qa-relres-fair
  • 67. Datafields repeatable/non-repeatable Indicator1 Indicator2 Subfield1, ... , Subfieldn always 1 char long dictionary term ❏ code ❏ value ❏ free text ❏ dictionary term ❏ fixed format (e.g. yymmdd) ❏ fixed format + dictionary terms (d7i2) ❏ fixed positions + dictionary terms ❏ repeatable/non-repeatable 67 http://bit.ly/qa-relres-fair
  • 68. Versions ❏ Changes of the standard ❏ No versioning ❏ New, deleted and changed elements every year ❏ Localized versions ❏ Introducing new fields ❏ Overwriting existing fields ❏ Mixing localized versions ❏ No notion about the localization ❏ 50+ localizations (international, national, consortial) 68 http://bit.ly/qa-relres-fair
  • 69. Handling versions (020, ISBN) setSubfieldsWithCardinality( "a", "International Standard Book Number", "NR", "c", "Terms of availability", "NR", "q", "Qualifying information", "R", ... ); setHistoricalSubfields( "b", "Binding information (BK, MP, MU) [OBSOLETE]" ); putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList( new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R") )); 69 http://bit.ly/qa-relres-fair
  • 70. Addressing elements - MARCspec XML: XPath﹣W3C standard JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/) MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin) ❏ 260﹣field ❏ 245^2﹣the second indicator of a field ❏ 700[0]﹣the first instance of a field ❏ 245$c﹣a subfield ❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with position ‘0’ of field 007 equals ‘a’ OR ‘t’. ❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’. http://marcspec.github.io/MARCspec/marc-spec.html 70 http://bit.ly/qa-relres-fair
  • 71. record validation and quality assurance Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg 71 http://bit.ly/qa-relres-fair
  • 72. validating individual records ./validator [file] 001999999 852 undefined subfield L https://www.loc.gov/... 002000005 035 undefined subfield 9 https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000008 035 undefined subfield 9 https://www.loc.gov/… 72 http://bit.ly/qa-relres-fair
  • 73. summary of errors ./validator --summary [file] 006/01-02 (tag006music01): invalid value ' ' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2 times) 73 http://bit.ly/qa-relres-fair
  • 74. other options ./validator --marcVersion “GENT” [file] ./validator --format “tsv” [file] ./validator --defaultRecordType “BOOKS” [file] SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm' ./validator --fileName “my-report” [file] ./validator ... [file] | catmandu … | RScript … | python … | grep ... 74 http://bit.ly/qa-relres-fair
  • 75. viewing/filtering/selecting records Displaying record with given ID ./formatter --id “002032820” [file] Displaying records matching a query ./formatter --search ‘245$c=Shakespeare’ [file] Retrieve given elements ./formatter --selector ‘245$c’ [file] 75 http://bit.ly/qa-relres-fair
  • 76. variation to weighted completeness Thompson and Traill (2017) 76 http://bit.ly/qa-relres-fair
  • 77. calculating Thompson-Traill completeness ./tt-completeness [options] [file] output: id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish,RDA,total "010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4 "01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5 "010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5 "010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6 "010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7 77 http://bit.ly/qa-relres-fair
  • 78. K-means clustering Spark (Scala) increasing number of clusters decreasing the distance from the centroids after a point this gain is not so big (“elbow effect”) -- in theory Big number or low quality records small clusters with ‘in between’ quality records the acceptable average clusters with good quality records 78 http://bit.ly/qa-relres-fair
  • 79. Indexing with Solr "marc-tags" format "100a_ss": "Jung-Baek, Myong Ja", "100ind1_ss": "Surname", "245c_ss": "Vorgelegt von Myong Ja Jung-Baek." "human-readable" format "MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "MainPersonalName_type_ss": "Surname", "Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." "mixed" format "100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "100ind1_MainPersonalName_type_ss": "Surname", "245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." 79 How to name the fields? http://bit.ly/qa-relres-fair
  • 81. accessing every record element 81 http://bit.ly/qa-relres-fair
  • 82. Finding problems with facets Vandenhoeck und Ruprecht Vandenhoeck & Ruprecht Vandenhoeck u. Ruprecht Vandenhoeck Vandenhoek & Ruprecht Vandenhoek und Ruprecht Bandenhoed und Ruprecht Vandenhoeck et Ruprecht Vandenhoeck & Reprecht Vandenhoed und Ruprecht V&R unipress V&R Unipress V & R Unipress V & R unipress 82 http://bit.ly/qa-relres-fair
  • 83. http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html Usage in DH Benjamin Smith (2017) A brief visual history of MARC cataloging at the Library of Congress. 1. extract fields from MARC 2. data cleaning 3. visualize with R 83 http://bit.ly/qa-relres-fair
  • 84. ./formatter --selector "260c;008~0-5" [file] > dates.tsv or put into a cleaning pileline ./formatter --selector "260c;008~0-5" [file] | sed ... | grep ... | awk ... > dates.tsv Extract data 260c 008~0- 5 1977. 780804 1977. 781121 [1973]. 740215 publication record 1977 1978-08-04 1977 1978-11-21 1973 1974-02-15 84 http://bit.ly/qa-relres-fair
  • 85. Filtering out extreme values data %>% filter(publication > 2018) %>% arrange(desc(publication)) publication record <int> <int> 1 5732 1990 2 4185 2013 3 2201 2012 4 2030 2015 5 2022 2016 6 2020 2011 7 2019 2015 85 http://bit.ly/qa-relres-fair
  • 86. cataloging frontline intensive backward cataloging - maybe importing? backward cataloging is still intensive, the tendency continues peak is > 13K 2000-07-10, the “golden day”: 95K new records forward cataloging 86 http://bit.ly/qa-relres-fair
  • 88. reproducibility of science ❏ accessing users (first one: Gent) ❏ making easy of usage (downloadable binaries, helper scripts, documentation) ❏ distribution via Maven Central ❏ continuous integration (Travis CI) ❏ code coverage report ❏ list of freely reusable library catalogs ❏ licencing (GPL-3.0) 88 http://bit.ly/qa-relres-fair
  • 89. available catalogs to measure 89 ❏ Library of Congress ❏ Harvard University Library ❏ Columbia University Library ❏ Deutsche Nationalbibliothek ❏ Universiteitsbibliotheek Gent ❏ Bibliotheksservice-Zentrum Baden Würtemberg ❏ Bibliotheksverbundes Bayern ❏ University of Michigan Library ❏ Toronto Public Library ❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB) ❏ Répertoire International des Sources Musicales ❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich) ❏ British library ❏ Talis https://github.com/pkiraly/metadata-qa-marc#datasources http://bit.ly/qa-relres-fair
  • 90. future work ❏ implementing more validation rules ❏ visual dashboard ❏ communication with catalogers ❏ writing articles/dissertation 90 http://bit.ly/qa-relres-fair
  • 91. authority entries Responsibility statement: Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent (vormgeving). Authority entries: ❏ Herr Seele Kris Coremans is missing! ❏ Coussement, Toon ❏ Claes, Peter ❏ Van Sande, Hera 91 http://bit.ly/qa-relres-fair
  • 92. everything else … at least regarding to this project https://github.com/pkiraly/metadata-qa-marc http://pkiraly.github.io https://twitter.com/kiru peter.kiraly@gwdg.de 92 http://bit.ly/qa-relres-fair