Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metadata Quality Assurance Framework at QQML2016 conference - full version

1,149 views

Published on

Metadata Quality Assurance Framework at QQML2016 conference - full version

Published in: Data & Analytics
  • A 7 Time Lotto Winner Stepped Up to Share His Secrets With YOU ●●● https://tinyurl.com/t2onem4
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Watch My Free Video Packed with My Best Secrets Here ♥♥♥ http://t.cn/Airfq84N
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • HOT MILFS LOOKING FOR SEX! F.U.C.K A MILF NEAR YOU TONIGHT!➤➤ http://t.cn/AiuWKDWR
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...●●● https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...★★★ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Metadata Quality Assurance Framework at QQML2016 conference - full version

  1. 1. Metadata Quality Assurance Framework Péter Király <peter.kiraly@gwdg.de> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany QQML2016 8th International Conference on Qualitative and Quantitative Methods in Libraries 2016-05-24, London
  2. 2. Metadata Quality Assurance Framework 2 the problem there are „good” and „bad” metadata records
  3. 3. Metadata Quality Assurance Framework 3 Typical issues – non-informative field  Title is not informative non informative: „photograph, framed”, „group photograph” „photograph” vs informative: „Photograph of Sir Dugald Clerk”, „Photograph of "Puffing Billy"
  4. 4. Metadata Quality Assurance Framework 4 Typical issues – Copy & paste cataloging  Keeping placeholders / templates
  5. 5. Metadata Quality Assurance Framework 5 Typical issues – Field overuse  What is the meaning of the field? (overuse) TextGrid OAI-PMH response
  6. 6. Metadata Quality Assurance Framework 6 Why data quality is important? „Fitness for purpose” (QA principle) no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft 19 May 2016 https://www.w3.org/TR/dwbp/
  7. 7. Metadata Quality Assurance Framework 7 Europeana Data Quality Committee  Online collaboration  Use case documents  Problem catalog  Tickets  Discussion forum  #EuropeanaDataQuality  Bi-weekly teleconf  Bi-yearly face-to-face meeting  Topics  Usage scenarios  Metadata profiles  Schema modification  Measuring  Event model  Proposals for data providers
  8. 8. Metadata Quality Assurance Framework 8 Research hypothesis hypothesis with measuring structural elements we can predict metadata record quality
  9. 9. Metadata Quality Assurance Framework 9 What it is good for?  improve the metadata  improve services: good data → functions  improve metadata schema & documentation  propagate „good practice” Domains:  cultural heritage sector  research data management and archiving
  10. 10. Metadata Quality Assurance Framework 10 Research hypothesis proposed solution Metadata Quality Assurance Framework
  11. 11. Metadata Quality Assurance Framework 11 What to measure?
  12. 12. Metadata Quality Assurance Framework 12 Measurements  Schema-independent structural features existence, cardinality, uniqueness, length, dictionary entry, data type conformance  Use case scenarios („fit for purpose”) Requirements of the most important functions  Problem catalog Known metadata problems
  13. 13. Metadata Quality Assurance Framework 13 Discovery scenarios and their metadata requirements Europeana’s most important functions 1. Basic retrieval with high precision and recall 2. Cross-language recall 3. Entity-based facets 4. Date-based facets 5. Improved language facets 6. Browse by subjects and resource types 7. Browse by agents 8. Browse/Search by Event 9. Entity-based knowledge cards and pages 10. Categorised similar items 11. Spatial search, browse, and map display 12. Entity-based autocompletion 13. Diversification of results 14. Hierarchical search and facets Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
  14. 14. Metadata Quality Assurance Framework 14 Discovery scenarios and their metadata requirements – Entity-based facets Scenario As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis In each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages. Measurement rules  The relevant field values should be resolvable URI  each URI should have labels in multiple languages
  15. 15. Metadata Quality Assurance Framework 15 Discovery scenarios and their metadata requirements – Date-based facets Scenario I want to be able to filter my results by a variety of timespans, e.g.:  Date of creation  Date of publication  Date as subject Metadata analysis Dates should be fully and consistently normalised to follow the XSD date-time data types. Dates expressed in styles like “490 avant J.C” that are inherently language dependent should be avoided as they’re very difficult to normalise (e.g. this should be represented as “- 0490”^^xsd:gYear). Measurement rules  Field value should be XSD date-time data types
  16. 16. Metadata Quality Assurance Framework 16 Problem catalog Catalog of known metadata problems in Europeana  Title contents same as description contents  Systematic use of the same title  Bad string: "empty" (and variants)  Shelfmarks and other identifiers in fields  Creator not an agent name  Absurd geographical location  Subject field used as description field  Unicode U+FFFD (�)  Very short description field  ... Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
  17. 17. Metadata Quality Assurance Framework 17 Problem catalog Description Title contents same as description contents Example /2023702/35D943DF60D779EC9EF31F5DF... Motivation Distorts search weightings Checking Method Field comparison Notes Record display: creator concatenated onto title Metadata Scenario Basic Retrieval
  18. 18. Metadata Quality Assurance Framework 18 How to define measurements?
  19. 19. Metadata Quality Assurance Framework 19 Problem catalog – proposed basis of implementation Shapes Constraint Language (SHACL) https://www.w3.org/TR/shacl/ A language for describing and constraining the contents of RDF graphs. It provides a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.  sh:equals, sh:notEquals  sh:hasValue  sh:in  sh:lessThan, sh:lessThanOrEquals  sh:minCount, sh:maxCount  sh:minLength, sh:maxLength  sh:pattern
  20. 20. Metadata Quality Assurance Framework 20 early measurement results and their visualization
  21. 21. Metadata Quality Assurance Framework 21 overall view collection view record view Completeness – 40 measurements Field cardinality – 27 measurements Uniqueness – 6 measurements Language specification – 20 measurements Problem catalog – 3 measurements etc. links measurementsaggregated numbers
  22. 22. Metadata Quality Assurance Framework 22 completeness What is the ratio of populated fields in records?
  23. 23. Metadata Quality Assurance Framework 23 Field frequency / main
  24. 24. Metadata Quality Assurance Framework 24 Field frequency / main Alternative title is a rare field
  25. 25. Metadata Quality Assurance Framework 25 Field frequency per collections / all no record has alternative title every record has alternative title
  26. 26. Metadata Quality Assurance Framework 26 Field frequency per collections / remove no-instances
  27. 27. Metadata Quality Assurance Framework 27 Field frequency per collections / display only complete collections
  28. 28. Metadata Quality Assurance Framework 28 cardinality How many field instances are in the records?
  29. 29. Metadata Quality Assurance Framework 29 Field cardinality – overview more field than record number of records
  30. 30. Metadata Quality Assurance Framework 30 Field cardinality – overview dc:type
  31. 31. Metadata Quality Assurance Framework 31 Field cardinality – histogram 128 subjects in one record median is 0, mean is close to 1 link to interesting records
  32. 32. Metadata Quality Assurance Framework 32 Field cardinality – an outlier
  33. 33. Metadata Quality Assurance Framework 33 multilinguality Do we know the language of a field value?
  34. 34. Metadata Quality Assurance Framework 34 Multilinguality @resource is a URI @ = language notation in RDF no language specification
  35. 35. Metadata Quality Assurance Framework 35 Language frequency / barchart
  36. 36. Metadata Quality Assurance Framework 36 Language frequency / barchart same language, different encodings
  37. 37. Metadata Quality Assurance Framework 37 Language frequency / Treemap has language specification has no language specification
  38. 38. Metadata Quality Assurance Framework 38 Language frequency / Treemap with resources has no language specification has language specification Is a URI
  39. 39. Metadata Quality Assurance Framework 39 Language frequency / Treemap + interaction + table hide/display categories table-like formal
  40. 40. Metadata Quality Assurance Framework 40 uniqueness (entropy) How unique the terms are in a field?
  41. 41. Metadata Quality Assurance Framework 41 Entropy – term uniqueness / main 1 means a unique term 0.0000x means a very frequent term These are cumulative numbers entropycumolative = term1 + ... + termn
  42. 42. Metadata Quality Assurance Framework 42 Entropy – term uniqueness / collection max is exceptional (=1425 * mean) unique records not or less unique records
  43. 43. Metadata Quality Assurance Framework 43 Entropy – term uniqueness / refining the picture bulk of records are close to zero although 25% are between 0.05 and 1.25
  44. 44. Metadata Quality Assurance Framework 44 Entropy – term uniqueness / field value Russian text in transcribed Latin writing szstem, not in Cyrillic
  45. 45. Metadata Quality Assurance Framework 45 Entropy – term uniqueness / terms explanation of uniqueness score TF-IDF values come from Apache Solr term frequency: 1 document freq.: 2 uniqueness score: 0.5
  46. 46. Metadata Quality Assurance Framework 46 problem catalog Does the record have any specific issues?
  47. 47. Metadata Quality Assurance Framework 47 Problem catalog – Long subject a record with 265 „long” subject heading
  48. 48. Metadata Quality Assurance Framework 48 Problem catalog – Long subject – example (not so long...) Conclusion: we have to refine the definition of „long”
  49. 49. Metadata Quality Assurance Framework 49 Problem catalog – same title and description there is one title and description which is the same ... and we have 9 such records
  50. 50. Metadata Quality Assurance Framework 50 Problem catalog – same title and description – example
  51. 51. Metadata Quality Assurance Framework 51 completeness sub-dimensions Are the sub-dimensions (field groups supporting specific functionalities) complete?
  52. 52. Metadata Quality Assurance Framework 52 Record view – functionality matrix existing missing functionalities
  53. 53. Metadata Quality Assurance Framework 53 miscellaneous
  54. 54. Metadata Quality Assurance Framework 54 Other elements of the record view
  55. 55. Metadata Quality Assurance Framework 55 Further steps  Incorporating into Europeana’s ingestion tool  Process usage statistics (logs, Google Analitics)  Human evaluation of metadata quality  Measuring timeliness (changes of scores over time)  Machine learning based classification & clustering  Incorporating into research data management tool  Cooperation with other projects
  56. 56. Metadata Quality Assurance Framework 56 Project principles  Scalable, ready for big data  Loose coupling to metadata schemas  Transparency: open source, open data (CC0)  Release early, release often  Getting real [1]  Collaboration and communication [1] https://gettingreal.37signals.com/
  57. 57. Metadata Quality Assurance Framework 57 Architectural overview Apache Spark (Java) OAI-PMH client (PHP) Analysis with Spark (Scala) Analysis with R Web interface (PHP, d3.js) Hadoop File System JSON files Apache Solr Apache Cassandra JSON files JSON files image files CSV files CSV files recent workflow planned workflow
  58. 58. Metadata Quality Assurance Framework 58 Follow me  Europeana Data Quality Committee http://pro.europeana.eu/europeana-tech/data- quality-committee  research plan and blog http://pkiraly.github.io  site http://144.76.218.178/europeana-qa/  source codes  https://github.com/pkiraly/europeana-qa-spark  https://github.com/pkiraly/europeana-qa-r  @kiru, https://www.linkedin.com/in/peterkiraly

×