Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taxonomy Quality Assessment


Published on

This talk addresses two questions: “How can the quality of taxonomies be defined?” and “How can it be measured?” See how quality criteria vary depending on how a taxonomy is applied, such as automatic content classification in ecommerce or a knowledge graph for data integration in enterprises. Distinguish between formal quality, structural properties, content coverage, and network topology. Investigate the advantages of standards-based and machine-processable SKOS taxonomies to be able to measure the quality of taxonomies automatically, as well as several tools and techniques for quality assessment.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Taxonomy Quality Assessment

  1. 1. Andreas Blumauer CEO & Managing Partner Semantic Web Company & PoolParty Semantic Suite TAXONOMY QUALITY ASSESSMENT: TOOLS & TECHNIQUES Taxonomy Boot Camp 2016 Washington, DC 1
  2. 2. INTRODUCTION 2 Semantic Web Company founder & CEO of Andreas Blumauer developer and vendor of 2004 founded 5.5 current Version active at based on Vienna located part of Taxonomy Knowledge Graph standard for part of is a >200serves customers Ontology manages part ofis a
  3. 3. Aspects of Taxonomy Quality Types of taxonomy quality metrics, and for which scenarios they are relevant 3
  4. 4. Why is taxonomy quality important? Some examples for quality issues and their possible consequences 4 ▸ Missing labels ▹ AGROVOC (FAO) defines concepts in 25 different languages. While most concepts have English labels attached, only 38% have German labels. ▹ This can be a problem for multilingual applications that rely on label translations. ▸ Orphan concepts ▹ An orphan concept is a concept that has no semantic relation with any other concept. Although it might have attached lexical labels, it lacks valuable context information. ▹ This can be crucial for retrieval tasks such as search query expansion. ▸ Mismatch between content and taxonomy ▹ There are only minor overlaps between the scope of the documents (or data) to be indexed and the scope of the controlled vocabulary in use. ▹ This leads to a sparse enrichment of the document index by semantic information. See also: Finding quality issues in SKOS vocabularies (Christian Mader, Bernhard Haslhofer, Antoine Isaac)
  5. 5. Taxonomy quality issues are more frequently observed than some might expect 5 See also: Finding quality issues in SKOS vocabularies
  6. 6. Taxonomy quality criteria and issues at different levels 6 1. Formal integrity conditions based on SKOS ▹ Construction of well-formed and consistent data to promote interoperability ▹ Example: No two concepts may be connected by both related and broader transitive ▹ Read more: SKOS: A Guide for Information Professionals (Jane Frazier) 2. Labeling and documentation issues ▹ Construction of taxonomies that allow support for complex retrieval tasks ▹ Example: No two concepts of a concept scheme may have the same preferred label ▹ Read more: SKOS Primer (Antoine Isaac / Ed Summers) 3. Structural issues ▹ Logic-based based processing of taxonomies ▹ Example: Avoidance of hierarchical cycles ▹ Read more: Key choices in the design of SKOS (Thomas Baker et al) 4. Content coverage ▹ Development of taxonomies that reflect well the scope of represented content ▹ Example: Avoid maintaining subtrees that only have limited occurrences in a representative document corpus ▹ Read more: Corpus management with PoolParty 5. Network topological issues (experimental) ▹ (Co-)occurrences of concepts in a corpus should be reflected in the network topology of a knowledge graph ▹ Example: Nodes/concepts with high betweenness centrality should occur correspondingly in a reference document corpus
  7. 7. Why are standards-based technologies and tools so important when it comes to taxonomy quality management? 7 Spreadsheet editors are still the most common type of software application being used for taxonomy management. They cannot measure quality automatically.
  8. 8. ‘Good’ quality depends on the usage scenario 8 Example: Google Product Taxonomy has no synonyms at all, only hierarchical relations
  9. 9. How to pick the most relevant quality criteria for a taxonomy project 9 PoolParty supports various application scenarios. Quality checks can be enforced, reported, or ignored.
  10. 10. How to pick the most relevant quality criteria for a taxonomy project 10 ▸ General purpose thesaurus vs. Custom enterprise taxonomy ▹ Custom enterprise taxonomies can be developed specifically on top of reference corpora ▹ General purpose thesauri are frequently used in the context of linked data environments → Linked data specific issues become more important ■ Missing In-Links ■ Missing Out-Links ■ Broken Links ■ Undefined SKOS Resources ■ HTTP URI Scheme Violation See also: PoolParty SKOS Quality Checker based on qSKOS
  11. 11. Taxonomy Quality Metrics How quality issues can be unveiled and how insights can be used for further improvements 11
  12. 12. Repair label issues 12
  13. 13. Repair structural issues 13
  14. 14. Unveil mismatch between taxonomy and document corpus 14 Content Manager Integrator Taxonomist/ Ontologist Thesaurus Server Extractor PowerTagging uses API is user of is user of is basis of is basis of Index annotates enriches Corpus Learning/ Semantic Analysis CMS extends is basis of analyzes uses API
  15. 15. Unveil mismatch between taxonomy and document corpus 15 PoolParty extracts concepts not being used in a reference corpus at all and provides suggestions how those concepts could be reworked or extended to become relevant.
  16. 16. Unveil mismatch between taxonomy and document corpus 16 PoolParty extracts relevant candidate concepts based on a deep corpus analysis.
  17. 17. Unveil mismatch between taxonomy and document corpus 17 PoolParty suggest possible ‘right places’ for the candidate concepts within the approved taxonomy.
  18. 18. Unveil network topological issues 18 Example: STW Thesaurus for Economics
  19. 19. Unveil network topological issues 19 Example: STW Thesaurus for Economics - Top 10 thesaurus concepts (betweenness)
  20. 20. Combined analysis over network topology and reference corpus 20 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  21. 21. Combined analysis over network topology and reference corpus 21 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  22. 22. Combined analysis over network topology and reference corpus: Correlation Betweenness & Document Frequency 22 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  23. 23. Techniques and Tools How they help to assess Taxonomy Quality 23
  24. 24. Basel Register of Thesauri, Ontologies & Classifications ▸ Unveil Taxonomy Quality by the Wisdom of the Crowd 24
  25. 25. qSKOS ▸ qSKOS is a tool for finding quality issues in SKOS vocabularies ▸ Available as free online service at ▸ SKOS taxonomy being analyzed with regards to 24 issues 25
  26. 26. PoolParty Import Validator 26 ▸ RDF Validation to go beyond SKOS ▸ Checks are defined in RDF, repair strategies also defined as RDF ▸ 15 checks have been integrated
  27. 27. Shapes Constraint Language (SHACL) ▸ “Do for RDF what XML Schema does for XML” ▸ Language for validating RDF graphs against a set of conditions ▸ SHACL shape graphs are used to validate that data graphs satisfy a set of conditions ▸ Current status: W3C Working Draft (14 August 2016) See also: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach (Christian Mader and Monika Solanki) 27
  28. 28. GET YOUR TEST ACCOUNT GET CERTIFIED 28 Get your test account at Get certified at
  29. 29. CONNECT Andreas Blumauer CEO, Semantic Web Company ▸ ▸ ▸ ▸ ▸ 29 © Semantic Web Company - and