SlideShare a Scribd company logo
1 of 40
Metadata Quality Assurance Framework
Péter Király <peter.kiraly@gwdg.de>
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany
QQML2016
8th International Conference on Qualitative and Quantitative Methods in Libraries
2016-05-24, London
Metadata Quality Assurance Framework
2
the problem
there are „good” and „bad” metadata records
Metadata Quality Assurance Framework
3
Typical issues – non-informative field
 Title is not informative
non informative:
„photograph, framed”,
„group photograph”
„photograph”
vs
informative:
„Photograph of Sir
Dugald Clerk”,
„Photograph of "Puffing Billy"
Metadata Quality Assurance Framework
4
Typical issues – Field overuse
 What is the meaning of the field? (overuse)
TextGrid OAI-PMH response
Metadata Quality Assurance Framework
5
Why data quality is important?
„Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 19 May 2016
https://www.w3.org/TR/dwbp/
Metadata Quality Assurance Framework
6
Europeana Data Quality Committee
 Online collaboration
 Use case documents
 Problem catalog
 Tickets
 Discussion forum
 #EuropeanaDataQuality
 Bi-weekly teleconf
 Bi-yearly face-to-face
meeting
 Topics
 Usage scenarios
 Metadata profiles
 Schema modification
 Measuring
 Event model
 Proposals for data
providers
Metadata Quality Assurance Framework
7
What it is good for?
 improve the metadata
 improve services: good data → functions
 improve metadata schema & documentation
 propagate „good practice”
Domains:
 cultural heritage sector
 research data management and archiving
Metadata Quality Assurance Framework
8
Research hypothesis
hypothesis
with measuring structural elements we
can predict metadata record quality
Metadata Quality Assurance Framework
9
Research hypothesis
proposed solution
an open source measuring and reporting tool
Metadata Quality Assurance Framework
Metadata Quality Assurance Framework
10
What to measure?
Metadata Quality Assurance Framework
11
Measurements
 Schema-independent structural features
existence, cardinality, uniqueness, length,
dictionary entry, data type conformance
 Use case scenarios („fit for purpose”)
Requirements of the most important functions
 Problem catalog
Known metadata problems
Metadata Quality Assurance Framework
12
Discovery scenarios and their metadata requirements
Europeana’s most important functions
1. Basic retrieval with high precision and recall
2. Cross-language recall
3. Entity-based facets
4. Date-based facets
5. Improved language facets
6. Browse by subjects and resource types
7. Browse by agents
8. Browse/Search by Event
9. Entity-based knowledge cards and pages
10. Categorised similar items
11. Spatial search, browse, and map display
12. Entity-based autocompletion
13. Diversification of results
14. Hierarchical search and facets
Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
Metadata Quality Assurance Framework
13
Discovery scenarios and their metadata requirements – Entity-based facets
Scenario
As a user I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated by identifying URIs rather than free
text. These URIs need to be related, at a minimum, to a label for
each of the supported languages.
Measurement rules
 The relevant field values should be resolvable URI
 each URI should have labels in multiple languages
Metadata Quality Assurance Framework
14
Problem catalog
Catalog of known metadata problems in Europeana
 Title contents same as description contents
 Systematic use of the same title
 Bad string: "empty" (and variants)
 Shelfmarks and other identifiers in fields
 Creator not an agent name
 Absurd geographical location
 Subject field used as description field
 Unicode U+FFFD (�)
 Very short description field
 ...
Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
Metadata Quality Assurance Framework
15
How to define measurements?
Metadata Quality Assurance Framework
16
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)
https://www.w3.org/TR/shacl/
A language for describing and constraining the contents of RDF
graphs. It provides a high-level vocabulary to identify predicates and
their associated cardinalities, datatypes and other constraints.
 sh:equals, sh:notEquals
 sh:hasValue
 sh:in
 sh:lessThan, sh:lessThanOrEquals
 sh:minCount, sh:maxCount
 sh:minLength, sh:maxLength
 sh:pattern
Metadata Quality Assurance Framework
17
early measurement results
and their visualization
Metadata Quality Assurance Framework
18
overall view collection view record view
Completeness – 40 measurements
Field cardinality – 27 measurements
Uniqueness – 6 measurements
Language specification – 20 measurements
Problem catalog – 3 measurements
etc.
links
measurementsaggregated numbers
Metadata Quality Assurance Framework
19
completeness
What is the ratio of populated fields in records?
Metadata Quality Assurance Framework
20
Field frequency / main
Alternative title is a rare field
Metadata Quality Assurance Framework
21
Field frequency per collections / all
no record has alternative title
every record has alternative title
Metadata Quality Assurance Framework
22
multilinguality
Do we know the language of a field value?
Metadata Quality Assurance Framework
23
Multilinguality
@resource is a URI
@ = language notation in RDF
no language specification
Metadata Quality Assurance Framework
24
Language frequency / barchart
Metadata Quality Assurance Framework
25
Language frequency / barchart
same language,
different encodings
Metadata Quality Assurance Framework
26
Language frequency / Treemap with resources
has no language
specification
has language
specification
Is a URI
Metadata Quality Assurance Framework
27
uniqueness (entropy)
How unique the terms are in a field?
Metadata Quality Assurance Framework
28
Entropy – term uniqueness / main
1 means a unique term
0.0000x means a very frequent term
These are cumulative numbers
entropycumolative = term1 + ... + termn
Metadata Quality Assurance Framework
29
Entropy – term uniqueness / collection
max is exceptional (=1425 * mean)
unique records
not or less unique records
Metadata Quality Assurance Framework
30
Entropy – term uniqueness / refining the picture
bulk of records are close to zero
although 25% are between 0.05 and 1.25
Metadata Quality Assurance Framework
31
Entropy – term uniqueness / terms
explanation of uniqueness score
TF-IDF values come from Apache Solr
term frequency: 1
document freq.: 2
uniqueness score: 0.5
Metadata Quality Assurance Framework
32
problem catalog
Does the record have any specific issues?
Metadata Quality Assurance Framework
33
Problem catalog – same title and description
there is one title and
description which is the same
... and we have 9 such records
Metadata Quality Assurance Framework
34
Problem catalog – same title and description – example
Metadata Quality Assurance Framework
35
completeness sub-dimensions
Are the sub-dimensions (field groups
supporting specific functionalities) complete?
Metadata Quality Assurance Framework
36
Record view – functionality matrix
existing
missing
functionalities
Metadata Quality Assurance Framework
37
miscellaneous
Metadata Quality Assurance Framework
38
Further steps
 Incorporating into Europeana’s ingestion tool
 Process usage statistics (logs, Google Analitics)
 Human evaluation of metadata quality
 Measuring timeliness (changes of scores over time)
 Machine learning based classification & clustering
 Incorporating into research data management tool
 Cooperation with other projects
Metadata Quality Assurance Framework
39
Architectural overview
Apache Spark
(Java)
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(PHP, d3.js)
Hadoop File
System
JSON files
Apache Solr
Apache
Cassandra
JSON files
JSON files image files
CSV files
CSV files
recent workflow
planned workflow
Metadata Quality Assurance Framework
40
Follow me
 Europeana Data Quality Committee
http://pro.europeana.eu/europeana-tech/data-
quality-committee
 research plan and blog http://pkiraly.github.io
 site http://144.76.218.178/europeana-qa/
 source codes
 https://github.com/pkiraly/europeana-qa-spark
 https://github.com/pkiraly/europeana-qa-r
 @kiru, https://www.linkedin.com/in/peterkiraly

More Related Content

What's hot

A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Nandana Mihindukulasooriya
 
Identifying Relevant Sources for Data Linking using a Semantic Web Index
Identifying Relevant Sources for Data Linking using a Semantic Web IndexIdentifying Relevant Sources for Data Linking using a Semantic Web Index
Identifying Relevant Sources for Data Linking using a Semantic Web Index
Andriy Nikolov
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
Vlad Vega
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
Sebastian Hellmann
 

What's hot (20)

A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
 
OAI Metadata: Why and How
OAI Metadata: Why and HowOAI Metadata: Why and How
OAI Metadata: Why and How
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
Identifying Relevant Sources for Data Linking using a Semantic Web Index
Identifying Relevant Sources for Data Linking using a Semantic Web IndexIdentifying Relevant Sources for Data Linking using a Semantic Web Index
Identifying Relevant Sources for Data Linking using a Semantic Web Index
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DL
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
 
FAIR History and the Future
FAIR History and the FutureFAIR History and the Future
FAIR History and the Future
 
Linked data as a library data platform
Linked data as a library data platformLinked data as a library data platform
Linked data as a library data platform
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
OSFair2017 Training | FAIR metrics - Starring your data sets
OSFair2017 Training | FAIR metrics - Starring your data setsOSFair2017 Training | FAIR metrics - Starring your data sets
OSFair2017 Training | FAIR metrics - Starring your data sets
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk2010 09 opm_tutorial_01-jun-usecase-datagovuk
2010 09 opm_tutorial_01-jun-usecase-datagovuk
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
Data analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical mapsData analysis in dataverse & visualization of datasets on historical maps
Data analysis in dataverse & visualization of datasets on historical maps
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
 

Viewers also liked

BalticMiles We Love to Give You More
BalticMiles We Love to Give You More BalticMiles We Love to Give You More
BalticMiles We Love to Give You More
NORD DDB RIGA
 
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
Faculty of Economics in Osijek
 
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
rina fitri
 

Viewers also liked (20)

BalticMiles We Love to Give You More
BalticMiles We Love to Give You More BalticMiles We Love to Give You More
BalticMiles We Love to Give You More
 
Metadata quality criteria
Metadata quality criteriaMetadata quality criteria
Metadata quality criteria
 
Čempionu Brokastis #23 / Miks Koljērs / "Funny ha-ha or funny peculiar?"
Čempionu Brokastis #23 / Miks Koljērs / "Funny ha-ha or funny peculiar?"Čempionu Brokastis #23 / Miks Koljērs / "Funny ha-ha or funny peculiar?"
Čempionu Brokastis #23 / Miks Koljērs / "Funny ha-ha or funny peculiar?"
 
Čempionu Brokastis #22 / Daunis Auers / "Karš, karavīri un kvass"
Čempionu Brokastis #22 / Daunis Auers / "Karš, karavīri un kvass"Čempionu Brokastis #22 / Daunis Auers / "Karš, karavīri un kvass"
Čempionu Brokastis #22 / Daunis Auers / "Karš, karavīri un kvass"
 
Gu 2016 programma conegliano 2016 co1
Gu 2016 programma conegliano 2016 co1Gu 2016 programma conegliano 2016 co1
Gu 2016 programma conegliano 2016 co1
 
Serbia in the (Lo)Clouds
Serbia in the (Lo)CloudsSerbia in the (Lo)Clouds
Serbia in the (Lo)Clouds
 
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
Vinkovci najstariji grad Europe // Marija Vrljić, Mladen Mustapić, Nikolina K...
 
Transform customer experience through PHYGITAL
Transform customer experience through PHYGITALTransform customer experience through PHYGITAL
Transform customer experience through PHYGITAL
 
A jók és a rosszak - metaadatok minőségellenőrzése
A jók és a rosszak - metaadatok minőségellenőrzéseA jók és a rosszak - metaadatok minőségellenőrzése
A jók és a rosszak - metaadatok minőségellenőrzése
 
europeana agm 2015, 4/11, bp 2015 to 2016 - strategic positioning &amp; e280 ...
europeana agm 2015, 4/11, bp 2015 to 2016 - strategic positioning &amp; e280 ...europeana agm 2015, 4/11, bp 2015 to 2016 - strategic positioning &amp; e280 ...
europeana agm 2015, 4/11, bp 2015 to 2016 - strategic positioning &amp; e280 ...
 
Čempionu Brokastis #21 /Renārs Liepiņš & Jānis Lazda-Lazdiņš / "No Kannu zāle...
Čempionu Brokastis #21 /Renārs Liepiņš & Jānis Lazda-Lazdiņš / "No Kannu zāle...Čempionu Brokastis #21 /Renārs Liepiņš & Jānis Lazda-Lazdiņš / "No Kannu zāle...
Čempionu Brokastis #21 /Renārs Liepiņš & Jānis Lazda-Lazdiņš / "No Kannu zāle...
 
Kā radīt pašpietiekamu ziņu - reklāmu, ko cilvēki paši padarīs populāru?
Kā radīt pašpietiekamu ziņu - reklāmu, ko cilvēki paši padarīs populāru?Kā radīt pašpietiekamu ziņu - reklāmu, ko cilvēki paši padarīs populāru?
Kā radīt pašpietiekamu ziņu - reklāmu, ko cilvēki paši padarīs populāru?
 
Čempionu Brokastis #23 / Edgars Lapiņš / "Autentisks mārketings kritiski domā...
Čempionu Brokastis #23 / Edgars Lapiņš / "Autentisks mārketings kritiski domā...Čempionu Brokastis #23 / Edgars Lapiņš / "Autentisks mārketings kritiski domā...
Čempionu Brokastis #23 / Edgars Lapiņš / "Autentisks mārketings kritiski domā...
 
Középiskolai könyvtárhasználati óra
Középiskolai könyvtárhasználati óraKözépiskolai könyvtárhasználati óra
Középiskolai könyvtárhasználati óra
 
Rakstveida saziņa. Vēstule
Rakstveida saziņa. VēstuleRakstveida saziņa. Vēstule
Rakstveida saziņa. Vēstule
 
Könyvtári rendszer
Könyvtári rendszer Könyvtári rendszer
Könyvtári rendszer
 
A Wikipédia; Hivatkozás elektronikus dokumentumokra
A Wikipédia; Hivatkozás elektronikus dokumentumokraA Wikipédia; Hivatkozás elektronikus dokumentumokra
A Wikipédia; Hivatkozás elektronikus dokumentumokra
 
The Future of Historic Sounds – a prelude
The Future of Historic Sounds – a preludeThe Future of Historic Sounds – a prelude
The Future of Historic Sounds – a prelude
 
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
Contoh rpp-kimia-kls-x-pertemuan 2-kurklm-2013
 
Generation Z
Generation ZGeneration Z
Generation Z
 

Similar to Metadata quality Assurance Framework at QQML2016 - short

Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Artificial Intelligence Institute at UofSC
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
butest
 
Let’s go on a FAIR safari!
Let’s go on a FAIR safari!Let’s go on a FAIR safari!
Let’s go on a FAIR safari!
Carole Goble
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
 

Similar to Metadata quality Assurance Framework at QQML2016 - short (20)

How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge Graphs
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Metadata Quality assessment tool for Open Access
Metadata Quality assessment tool for Open AccessMetadata Quality assessment tool for Open Access
Metadata Quality assessment tool for Open Access
 
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
 
Dublin Core In Practice
Dublin Core In PracticeDublin Core In Practice
Dublin Core In Practice
 
Towards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata QualityTowards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata Quality
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
 
Data Quality
Data QualityData Quality
Data Quality
 
Thesis Defense MBI
Thesis Defense MBIThesis Defense MBI
Thesis Defense MBI
 
Preservation Metadata
Preservation MetadataPreservation Metadata
Preservation Metadata
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open Data
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
Let’s go on a FAIR safari!
Let’s go on a FAIR safari!Let’s go on a FAIR safari!
Let’s go on a FAIR safari!
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 

More from Péter Király

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 

More from Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 

Recently uploaded

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Metadata quality Assurance Framework at QQML2016 - short

  • 1. Metadata Quality Assurance Framework Péter Király <peter.kiraly@gwdg.de> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany QQML2016 8th International Conference on Qualitative and Quantitative Methods in Libraries 2016-05-24, London
  • 2. Metadata Quality Assurance Framework 2 the problem there are „good” and „bad” metadata records
  • 3. Metadata Quality Assurance Framework 3 Typical issues – non-informative field  Title is not informative non informative: „photograph, framed”, „group photograph” „photograph” vs informative: „Photograph of Sir Dugald Clerk”, „Photograph of "Puffing Billy"
  • 4. Metadata Quality Assurance Framework 4 Typical issues – Field overuse  What is the meaning of the field? (overuse) TextGrid OAI-PMH response
  • 5. Metadata Quality Assurance Framework 5 Why data quality is important? „Fitness for purpose” (QA principle) no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft 19 May 2016 https://www.w3.org/TR/dwbp/
  • 6. Metadata Quality Assurance Framework 6 Europeana Data Quality Committee  Online collaboration  Use case documents  Problem catalog  Tickets  Discussion forum  #EuropeanaDataQuality  Bi-weekly teleconf  Bi-yearly face-to-face meeting  Topics  Usage scenarios  Metadata profiles  Schema modification  Measuring  Event model  Proposals for data providers
  • 7. Metadata Quality Assurance Framework 7 What it is good for?  improve the metadata  improve services: good data → functions  improve metadata schema & documentation  propagate „good practice” Domains:  cultural heritage sector  research data management and archiving
  • 8. Metadata Quality Assurance Framework 8 Research hypothesis hypothesis with measuring structural elements we can predict metadata record quality
  • 9. Metadata Quality Assurance Framework 9 Research hypothesis proposed solution an open source measuring and reporting tool Metadata Quality Assurance Framework
  • 10. Metadata Quality Assurance Framework 10 What to measure?
  • 11. Metadata Quality Assurance Framework 11 Measurements  Schema-independent structural features existence, cardinality, uniqueness, length, dictionary entry, data type conformance  Use case scenarios („fit for purpose”) Requirements of the most important functions  Problem catalog Known metadata problems
  • 12. Metadata Quality Assurance Framework 12 Discovery scenarios and their metadata requirements Europeana’s most important functions 1. Basic retrieval with high precision and recall 2. Cross-language recall 3. Entity-based facets 4. Date-based facets 5. Improved language facets 6. Browse by subjects and resource types 7. Browse by agents 8. Browse/Search by Event 9. Entity-based knowledge cards and pages 10. Categorised similar items 11. Spatial search, browse, and map display 12. Entity-based autocompletion 13. Diversification of results 14. Hierarchical search and facets Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
  • 13. Metadata Quality Assurance Framework 13 Discovery scenarios and their metadata requirements – Entity-based facets Scenario As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis In each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages. Measurement rules  The relevant field values should be resolvable URI  each URI should have labels in multiple languages
  • 14. Metadata Quality Assurance Framework 14 Problem catalog Catalog of known metadata problems in Europeana  Title contents same as description contents  Systematic use of the same title  Bad string: "empty" (and variants)  Shelfmarks and other identifiers in fields  Creator not an agent name  Absurd geographical location  Subject field used as description field  Unicode U+FFFD (�)  Very short description field  ... Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
  • 15. Metadata Quality Assurance Framework 15 How to define measurements?
  • 16. Metadata Quality Assurance Framework 16 Problem catalog – proposed basis of implementation Shapes Constraint Language (SHACL) https://www.w3.org/TR/shacl/ A language for describing and constraining the contents of RDF graphs. It provides a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.  sh:equals, sh:notEquals  sh:hasValue  sh:in  sh:lessThan, sh:lessThanOrEquals  sh:minCount, sh:maxCount  sh:minLength, sh:maxLength  sh:pattern
  • 17. Metadata Quality Assurance Framework 17 early measurement results and their visualization
  • 18. Metadata Quality Assurance Framework 18 overall view collection view record view Completeness – 40 measurements Field cardinality – 27 measurements Uniqueness – 6 measurements Language specification – 20 measurements Problem catalog – 3 measurements etc. links measurementsaggregated numbers
  • 19. Metadata Quality Assurance Framework 19 completeness What is the ratio of populated fields in records?
  • 20. Metadata Quality Assurance Framework 20 Field frequency / main Alternative title is a rare field
  • 21. Metadata Quality Assurance Framework 21 Field frequency per collections / all no record has alternative title every record has alternative title
  • 22. Metadata Quality Assurance Framework 22 multilinguality Do we know the language of a field value?
  • 23. Metadata Quality Assurance Framework 23 Multilinguality @resource is a URI @ = language notation in RDF no language specification
  • 24. Metadata Quality Assurance Framework 24 Language frequency / barchart
  • 25. Metadata Quality Assurance Framework 25 Language frequency / barchart same language, different encodings
  • 26. Metadata Quality Assurance Framework 26 Language frequency / Treemap with resources has no language specification has language specification Is a URI
  • 27. Metadata Quality Assurance Framework 27 uniqueness (entropy) How unique the terms are in a field?
  • 28. Metadata Quality Assurance Framework 28 Entropy – term uniqueness / main 1 means a unique term 0.0000x means a very frequent term These are cumulative numbers entropycumolative = term1 + ... + termn
  • 29. Metadata Quality Assurance Framework 29 Entropy – term uniqueness / collection max is exceptional (=1425 * mean) unique records not or less unique records
  • 30. Metadata Quality Assurance Framework 30 Entropy – term uniqueness / refining the picture bulk of records are close to zero although 25% are between 0.05 and 1.25
  • 31. Metadata Quality Assurance Framework 31 Entropy – term uniqueness / terms explanation of uniqueness score TF-IDF values come from Apache Solr term frequency: 1 document freq.: 2 uniqueness score: 0.5
  • 32. Metadata Quality Assurance Framework 32 problem catalog Does the record have any specific issues?
  • 33. Metadata Quality Assurance Framework 33 Problem catalog – same title and description there is one title and description which is the same ... and we have 9 such records
  • 34. Metadata Quality Assurance Framework 34 Problem catalog – same title and description – example
  • 35. Metadata Quality Assurance Framework 35 completeness sub-dimensions Are the sub-dimensions (field groups supporting specific functionalities) complete?
  • 36. Metadata Quality Assurance Framework 36 Record view – functionality matrix existing missing functionalities
  • 37. Metadata Quality Assurance Framework 37 miscellaneous
  • 38. Metadata Quality Assurance Framework 38 Further steps  Incorporating into Europeana’s ingestion tool  Process usage statistics (logs, Google Analitics)  Human evaluation of metadata quality  Measuring timeliness (changes of scores over time)  Machine learning based classification & clustering  Incorporating into research data management tool  Cooperation with other projects
  • 39. Metadata Quality Assurance Framework 39 Architectural overview Apache Spark (Java) OAI-PMH client (PHP) Analysis with Spark (Scala) Analysis with R Web interface (PHP, d3.js) Hadoop File System JSON files Apache Solr Apache Cassandra JSON files JSON files image files CSV files CSV files recent workflow planned workflow
  • 40. Metadata Quality Assurance Framework 40 Follow me  Europeana Data Quality Committee http://pro.europeana.eu/europeana-tech/data- quality-committee  research plan and blog http://pkiraly.github.io  site http://144.76.218.178/europeana-qa/  source codes  https://github.com/pkiraly/europeana-qa-spark  https://github.com/pkiraly/europeana-qa-r  @kiru, https://www.linkedin.com/in/peterkiraly