Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Metadata Quality Assurance
Péter Király
peter.kiraly@gwdg.de
Heyne Haus, Göttingen, 18/12/2015
Oberseminar Datenmanagement...
Metadata Quality Assurance Framework
2
What is metadata?
 Data about data
 Specifically: descriptive data about ...
 di...
Metadata Quality Assurance Framework
3
Why data quality is important?
„Fitness for purpose”
no metadata no access to data ...
Metadata Quality Assurance Framework
4
Symptoms of bad quality metadata
 Hard to identify („What it is?”)
 Hard to disti...
Metadata Quality Assurance Framework
5
Some typical issues
 Title is not informative
Metadata Quality Assurance Framework
6
Mixing different data types
 Numeric
 RDF resource
Metadata Quality Assurance Framework
7
Field overuse
 What is the meaning of the field?
 identifier
 relation
 source
...
Metadata Quality Assurance Framework
8
Copy & paste cataloguing
 Keeping placeholders / templates
Metadata Quality Assurance Framework
9
Same entity, differently recorded
 lucas cranach der ältere
 Cranach, Lucas (der ...
Metadata Quality Assurance Framework
10
Same entity recorded differently
Different displays, and content:
 http://dbpedia...
Metadata Quality Assurance Framework
11
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value ...
Metadata Quality Assurance Framework
12
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value ...
Metadata Quality Assurance Framework
13
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value ...
Metadata Quality Assurance Framework
14
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
Field group. A gr...
Metadata Quality Assurance Framework
15
Grouping fields by functionalities
Mandatory
Descriptiveness
Searchability
Context...
Metadata Quality Assurance Framework
16
Metrics
The foundational metrics were set by Bruce–
Hillmann, Stvilia, Ochoa–Duval...
Metadata Quality Assurance Framework
17
Data sources
 Europeana – the European digital library,
museum and archive: 48M+ ...
Metadata Quality Assurance Framework
18
Method: collection – measuring – sharing
 Data collection (ingestion) via REST AP...
Metadata Quality Assurance Framework
19
Method: collection – measuring – sharing
Measuring records
 Big data so it should...
Metadata Quality Assurance Framework
20
Method: collection – measuring – sharing
Statistical analysis
 Calculating descri...
Metadata Quality Assurance Framework
21
Method: collection – measuring – sharing
Completeness of 3 collections 2 response ...
Metadata Quality Assurance Framework
22
Method: collection – measuring – sharing
outputs
 Display results in an interacti...
Metadata Quality Assurance Framework
23
Method: collection – measuring – sharing
Data Quality Vocabulary (W3C Working Draf...
Metadata Quality Assurance Framework
24
What it is good for?
 Improve the metadata
 Improve metadata schema and its docu...
Metadata Quality Assurance Framework
25
Further steps
 Define meters by Domain Specific Language
 Pattern discovery, mac...
Metadata Quality Assurance Framework
26
Follow me
 Project plan and blog: http://pkiraly.github.io
 Software development...
Upcoming SlideShare
Loading in …5
×

Metadata Quality Assurance

1,030 views

Published on

Presentation held at Heyne Haus, Göttingen,18/12/2015 in Oberseminar Datenmanagement, Cloud und e-Infrastructure

Published in: Science
  • Be the first to comment

Metadata Quality Assurance

  1. 1. Metadata Quality Assurance Péter Király peter.kiraly@gwdg.de Heyne Haus, Göttingen, 18/12/2015 Oberseminar Datenmanagement, Cloud und e-Infrastructure Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
  2. 2. Metadata Quality Assurance Framework 2 What is metadata?  Data about data  Specifically: descriptive data about ...  digitized (or physical) object such as paintings, books, photos  larger datasets such as research data  Provides access points to the underlining data
  3. 3. Metadata Quality Assurance Framework 3 Why data quality is important? „Fitness for purpose” no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft 17 December 2015 http://www.w3.org/TR/2015/WD-dwbp-20151217/
  4. 4. Metadata Quality Assurance Framework 4 Symptoms of bad quality metadata  Hard to identify („What it is?”)  Hard to distinguish from other records  Misleading descriptions  Uninterpretable descriptions  Missing fields  Unreusable (lost original context)  Hard to find
  5. 5. Metadata Quality Assurance Framework 5 Some typical issues  Title is not informative
  6. 6. Metadata Quality Assurance Framework 6 Mixing different data types  Numeric  RDF resource
  7. 7. Metadata Quality Assurance Framework 7 Field overuse  What is the meaning of the field?  identifier  relation  source TextGrid OAI-PMH response
  8. 8. Metadata Quality Assurance Framework 8 Copy & paste cataloguing  Keeping placeholders / templates
  9. 9. Metadata Quality Assurance Framework 9 Same entity, differently recorded  lucas cranach der ältere  Cranach, Lucas (der Ältere) [Herstellung]  Cranach, Lucas (I) (naar tekening van)  Cranach, Lucas vanem (autor) Result of entity detection:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  10. 10. Metadata Quality Assurance Framework 10 Same entity recorded differently Different displays, and content:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  11. 11. Metadata Quality Assurance Framework 11 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record
  12. 12. Metadata Quality Assurance Framework 12 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record set (e.g. a collection from the same source)
  13. 13. Metadata Quality Assurance Framework 13 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a field – how users utilize the field?
  14. 14. Metadata Quality Assurance Framework 14 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.
  15. 15. Metadata Quality Assurance Framework 15 Grouping fields by functionalities Mandatory Descriptiveness Searchability Contextualisation Identification Browsing Viewing Re-Usability Multilinguality dc:title × × × × × dcterms:alternative × × × × dc:description × × × × × × dc:creator × × × × dc:publisher × × dc:contributor × Created by Valentine Charles, Europeana Research and Development team
  16. 16. Metadata Quality Assurance Framework 16 Metrics The foundational metrics were set by Bruce– Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.  Completeness  Accuracy  Conformance to expectations  Logical consistency and coherence  Accessibility  Timeliness  Provenance
  17. 17. Metadata Quality Assurance Framework 17 Data sources  Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema  TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records  Research data from the Göttingen Campus  Library catalogue records in MARC (Machine Readable Catalog) schema  Other open data
  18. 18. Metadata Quality Assurance Framework 18 Method: collection – measuring – sharing  Data collection (ingestion) via REST API, OAI- OMH harvesting, file download etc.  Issues:  GWDG cloud: 160 GB, Europeana: 300 GB  low I/O performance  Europeana OAI-PMH is in a „beta” state  OAI-PMH requires 10M+ HTTP requests  REST API requires 50M+ HTTP requests
  19. 19. Metadata Quality Assurance Framework 19 Method: collection – measuring – sharing Measuring records  Big data so it should be scalable  Apache Hadoop: MapReduce and friends  Plugable architecture: „meters”  UI: set parameters for meters  input: records, schema, meters, config files  output:  identifier, projected metadata fields  metric1, metric2, metric3 ... metricN
  20. 20. Metadata Quality Assurance Framework 20 Method: collection – measuring – sharing Statistical analysis  Calculating descriptive statistics with R/Julia/other tool  Derivation of numbers representing collections and fields from the record level measurements
  21. 21. Metadata Quality Assurance Framework 21 Method: collection – measuring – sharing Completeness of 3 collections 2 response types best in collection worst in collection similar records heterogenious records different manifestations
  22. 22. Metadata Quality Assurance Framework 22 Method: collection – measuring – sharing outputs  Display results in an interactive dashboard  REST API to share the raw data Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
  23. 23. Metadata Quality Assurance Framework 23 Method: collection – measuring – sharing Data Quality Vocabulary (W3C Working Draft) http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvAvailabilityMetric ; dqv:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "0.5"^^xsd:double .
  24. 24. Metadata Quality Assurance Framework 24 What it is good for?  Improve the metadata  Improve metadata schema and its docum.  Propagate „good practice”  Improve services: „good” data is ranked higher in search result list Specifically for GWDG:  Could be built in to current and planned data management / data archiving tools
  25. 25. Metadata Quality Assurance Framework 25 Further steps  Define meters by Domain Specific Language  Pattern discovery, machine learning, clustering  Connectors for data sources  „Jenkins for data publication” Problem catalogue Data source Schema Metadata QA Report
  26. 26. Metadata Quality Assurance Framework 26 Follow me  Project plan and blog: http://pkiraly.github.io  Software development:  https://github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service  https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library  https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API  https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit  @kiru, https://www.linkedin.com/in/peterkiraly

×