Andreas Blumauer
CEO & Managing Partner
Semantic Web Company &
PoolParty Semantic Suite
TAXONOMY QUALITY
ASSESSMENT:
TOOLS & TECHNIQUES
Taxonomy
Boot Camp 2016
Washington, DC
1
INTRODUCTION
2
Semantic Web
Company
founder &
CEO of
Andreas
Blumauer
developer and
vendor of
2004
founded
5.5
current
Version
active at
based on
Vienna
located
part of Taxonomy Knowledge Graph
standard for
part of is a
>200serves customers
Ontology
manages
part ofis a
Aspects of
Taxonomy Quality
Types of taxonomy quality metrics,
and for which scenarios they are relevant
3
Why is taxonomy
quality important?
Some examples for
quality issues and
their possible
consequences
4 ▸ Missing labels
▹ AGROVOC (FAO) defines concepts in 25 different languages. While most concepts have
English labels attached, only 38% have German labels.
▹ This can be a problem for multilingual applications that rely on label translations.
▸ Orphan concepts
▹ An orphan concept is a concept that has no semantic relation with any other concept.
Although it might have attached lexical labels, it lacks valuable context information.
▹ This can be crucial for retrieval tasks such as search query expansion.
▸ Mismatch between content and taxonomy
▹ There are only minor overlaps between the scope of the documents (or data) to be
indexed and the scope of the controlled vocabulary in use.
▹ This leads to a sparse enrichment of the document index by semantic information.
See also: Finding quality issues in SKOS vocabularies
(Christian Mader, Bernhard Haslhofer, Antoine Isaac)
Taxonomy quality
issues are more
frequently
observed than
some might expect
5
See also: Finding quality issues in SKOS vocabularies
Taxonomy quality
criteria and issues
at different levels
6
1. Formal integrity conditions based on SKOS
▹ Construction of well-formed and consistent data to promote interoperability
▹ Example: No two concepts may be connected by both related and broader transitive
▹ Read more: SKOS: A Guide for Information Professionals (Jane Frazier)
2. Labeling and documentation issues
▹ Construction of taxonomies that allow support for complex retrieval tasks
▹ Example: No two concepts of a concept scheme may have the same preferred label
▹ Read more: SKOS Primer (Antoine Isaac / Ed Summers)
3. Structural issues
▹ Logic-based based processing of taxonomies
▹ Example: Avoidance of hierarchical cycles
▹ Read more: Key choices in the design of SKOS (Thomas Baker et al)
4. Content coverage
▹ Development of taxonomies that reflect well the scope of represented content
▹ Example: Avoid maintaining subtrees that only have limited occurrences in a representative
document corpus
▹ Read more: Corpus management with PoolParty
5. Network topological issues (experimental)
▹ (Co-)occurrences of concepts in a corpus should be reflected in the network topology of a
knowledge graph
▹ Example: Nodes/concepts with high betweenness centrality should occur correspondingly
in a reference document corpus
Why are
standards-based
technologies and
tools so important
when it comes to
taxonomy quality
management?
7
Spreadsheet editors are still the most common type of software application
being used for taxonomy management. They cannot measure quality automatically.
‘Good’ quality
depends on the
usage scenario
8
Example: Google Product Taxonomy has no synonyms at all, only hierarchical relations
How to pick the
most relevant
quality criteria for a
taxonomy project
9
PoolParty supports various application scenarios. Quality checks can be enforced,
reported, or ignored.
How to pick the
most relevant
quality criteria for a
taxonomy project
10 ▸ General purpose thesaurus vs.
Custom enterprise taxonomy
▹ Custom enterprise taxonomies can be developed specifically on top of reference corpora
▹ General purpose thesauri are frequently used in the context of linked data environments
→ Linked data specific issues become more important
■ Missing In-Links
■ Missing Out-Links
■ Broken Links
■ Undefined SKOS Resources
■ HTTP URI Scheme Violation
See also: PoolParty SKOS Quality Checker based on qSKOS
Taxonomy
Quality Metrics
How quality issues can be unveiled
and how insights can be used for further improvements
11
Repair label issues
12
Repair structural
issues
13
Unveil mismatch
between taxonomy
and document
corpus
14 Content Manager
Integrator
Taxonomist/
Ontologist
Thesaurus
Server
Extractor
PowerTagging
uses API
is user of
is user of
is basis of
is basis of
Index
annotates
enriches
Corpus Learning/
Semantic Analysis
CMS
extends
is basis of
analyzes
uses API
Unveil mismatch
between taxonomy
and document
corpus
15
PoolParty extracts concepts not being used in a reference corpus at all and provides
suggestions how those concepts could be reworked or extended to become relevant.
Unveil mismatch
between taxonomy
and document
corpus
16
PoolParty extracts relevant candidate concepts based on a deep corpus analysis.
Unveil mismatch
between taxonomy
and document
corpus
17
PoolParty suggest possible ‘right places’ for the candidate concepts within the approved
taxonomy.
Unveil network
topological issues
18
Example: STW Thesaurus for Economics
Unveil network
topological issues
19
Example: STW Thesaurus for Economics - Top 10 thesaurus concepts (betweenness)
Combined analysis
over network
topology and
reference corpus
20
Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
Combined analysis
over network
topology and
reference corpus
21
Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
Combined analysis
over network
topology and
reference corpus:
Correlation
Betweenness &
Document
Frequency
22
Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
Techniques and Tools
How they help to assess
Taxonomy Quality
23
BARTOC.org
Basel Register of
Thesauri,
Ontologies &
Classifications
▸ Unveil Taxonomy Quality by the Wisdom of the Crowd
24
qSKOS
▸ qSKOS is a tool for finding quality issues in SKOS vocabularies
▸ Available as free online service at http://qskos.poolparty.biz/
▸ SKOS taxonomy being analyzed with regards to 24 issues
25
PoolParty Import
Validator
26
▸ RDF Validation to go beyond SKOS
▸ Checks are defined in RDF, repair strategies also defined as RDF
▸ 15 checks have been integrated
Shapes Constraint
Language (SHACL)
▸ “Do for RDF what XML Schema does for XML”
▸ Language for validating RDF graphs against a set of conditions
▸ SHACL shape graphs are used to validate that data graphs satisfy a set of
conditions
▸ Current status: W3C Working Draft (14 August 2016)
See also: Towards maintainable constraint validation and repair for taxonomies:
The PoolParty approach (Christian Mader and Monika Solanki)
27
GET YOUR
TEST ACCOUNT
GET CERTIFIED
28
Get your test account at
www.poolparty.biz/demo
Get certified at
www.poolparty.biz/academy/
CONNECT
Andreas Blumauer
CEO, Semantic Web Company
▸ a.blumauer@semantic-web.at
▸ http://at.linkedin.com/in/andreasblumauer
▸ https://twitter.com/semwebcompany
▸ https://www.poolparty.biz
▸ https://www.semantic-web.at
29
© Semantic Web Company - http://www.semantic-web.at/ and http://www.poolparty.biz/

Taxonomy Quality Assessment

  • 1.
    Andreas Blumauer CEO &Managing Partner Semantic Web Company & PoolParty Semantic Suite TAXONOMY QUALITY ASSESSMENT: TOOLS & TECHNIQUES Taxonomy Boot Camp 2016 Washington, DC 1
  • 2.
    INTRODUCTION 2 Semantic Web Company founder & CEOof Andreas Blumauer developer and vendor of 2004 founded 5.5 current Version active at based on Vienna located part of Taxonomy Knowledge Graph standard for part of is a >200serves customers Ontology manages part ofis a
  • 3.
    Aspects of Taxonomy Quality Typesof taxonomy quality metrics, and for which scenarios they are relevant 3
  • 4.
    Why is taxonomy qualityimportant? Some examples for quality issues and their possible consequences 4 ▸ Missing labels ▹ AGROVOC (FAO) defines concepts in 25 different languages. While most concepts have English labels attached, only 38% have German labels. ▹ This can be a problem for multilingual applications that rely on label translations. ▸ Orphan concepts ▹ An orphan concept is a concept that has no semantic relation with any other concept. Although it might have attached lexical labels, it lacks valuable context information. ▹ This can be crucial for retrieval tasks such as search query expansion. ▸ Mismatch between content and taxonomy ▹ There are only minor overlaps between the scope of the documents (or data) to be indexed and the scope of the controlled vocabulary in use. ▹ This leads to a sparse enrichment of the document index by semantic information. See also: Finding quality issues in SKOS vocabularies (Christian Mader, Bernhard Haslhofer, Antoine Isaac)
  • 5.
    Taxonomy quality issues aremore frequently observed than some might expect 5 See also: Finding quality issues in SKOS vocabularies
  • 6.
    Taxonomy quality criteria andissues at different levels 6 1. Formal integrity conditions based on SKOS ▹ Construction of well-formed and consistent data to promote interoperability ▹ Example: No two concepts may be connected by both related and broader transitive ▹ Read more: SKOS: A Guide for Information Professionals (Jane Frazier) 2. Labeling and documentation issues ▹ Construction of taxonomies that allow support for complex retrieval tasks ▹ Example: No two concepts of a concept scheme may have the same preferred label ▹ Read more: SKOS Primer (Antoine Isaac / Ed Summers) 3. Structural issues ▹ Logic-based based processing of taxonomies ▹ Example: Avoidance of hierarchical cycles ▹ Read more: Key choices in the design of SKOS (Thomas Baker et al) 4. Content coverage ▹ Development of taxonomies that reflect well the scope of represented content ▹ Example: Avoid maintaining subtrees that only have limited occurrences in a representative document corpus ▹ Read more: Corpus management with PoolParty 5. Network topological issues (experimental) ▹ (Co-)occurrences of concepts in a corpus should be reflected in the network topology of a knowledge graph ▹ Example: Nodes/concepts with high betweenness centrality should occur correspondingly in a reference document corpus
  • 7.
    Why are standards-based technologies and toolsso important when it comes to taxonomy quality management? 7 Spreadsheet editors are still the most common type of software application being used for taxonomy management. They cannot measure quality automatically.
  • 8.
    ‘Good’ quality depends onthe usage scenario 8 Example: Google Product Taxonomy has no synonyms at all, only hierarchical relations
  • 9.
    How to pickthe most relevant quality criteria for a taxonomy project 9 PoolParty supports various application scenarios. Quality checks can be enforced, reported, or ignored.
  • 10.
    How to pickthe most relevant quality criteria for a taxonomy project 10 ▸ General purpose thesaurus vs. Custom enterprise taxonomy ▹ Custom enterprise taxonomies can be developed specifically on top of reference corpora ▹ General purpose thesauri are frequently used in the context of linked data environments → Linked data specific issues become more important ■ Missing In-Links ■ Missing Out-Links ■ Broken Links ■ Undefined SKOS Resources ■ HTTP URI Scheme Violation See also: PoolParty SKOS Quality Checker based on qSKOS
  • 11.
    Taxonomy Quality Metrics How qualityissues can be unveiled and how insights can be used for further improvements 11
  • 12.
  • 13.
  • 14.
    Unveil mismatch between taxonomy anddocument corpus 14 Content Manager Integrator Taxonomist/ Ontologist Thesaurus Server Extractor PowerTagging uses API is user of is user of is basis of is basis of Index annotates enriches Corpus Learning/ Semantic Analysis CMS extends is basis of analyzes uses API
  • 15.
    Unveil mismatch between taxonomy anddocument corpus 15 PoolParty extracts concepts not being used in a reference corpus at all and provides suggestions how those concepts could be reworked or extended to become relevant.
  • 16.
    Unveil mismatch between taxonomy anddocument corpus 16 PoolParty extracts relevant candidate concepts based on a deep corpus analysis.
  • 17.
    Unveil mismatch between taxonomy anddocument corpus 17 PoolParty suggest possible ‘right places’ for the candidate concepts within the approved taxonomy.
  • 18.
  • 19.
    Unveil network topological issues 19 Example:STW Thesaurus for Economics - Top 10 thesaurus concepts (betweenness)
  • 20.
    Combined analysis over network topologyand reference corpus 20 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  • 21.
    Combined analysis over network topologyand reference corpus 21 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  • 22.
    Combined analysis over network topologyand reference corpus: Correlation Betweenness & Document Frequency 22 Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
  • 23.
    Techniques and Tools Howthey help to assess Taxonomy Quality 23
  • 24.
    BARTOC.org Basel Register of Thesauri, Ontologies& Classifications ▸ Unveil Taxonomy Quality by the Wisdom of the Crowd 24
  • 25.
    qSKOS ▸ qSKOS isa tool for finding quality issues in SKOS vocabularies ▸ Available as free online service at http://qskos.poolparty.biz/ ▸ SKOS taxonomy being analyzed with regards to 24 issues 25
  • 26.
    PoolParty Import Validator 26 ▸ RDFValidation to go beyond SKOS ▸ Checks are defined in RDF, repair strategies also defined as RDF ▸ 15 checks have been integrated
  • 27.
    Shapes Constraint Language (SHACL) ▸“Do for RDF what XML Schema does for XML” ▸ Language for validating RDF graphs against a set of conditions ▸ SHACL shape graphs are used to validate that data graphs satisfy a set of conditions ▸ Current status: W3C Working Draft (14 August 2016) See also: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach (Christian Mader and Monika Solanki) 27
  • 28.
    GET YOUR TEST ACCOUNT GETCERTIFIED 28 Get your test account at www.poolparty.biz/demo Get certified at www.poolparty.biz/academy/
  • 29.
    CONNECT Andreas Blumauer CEO, SemanticWeb Company ▸ a.blumauer@semantic-web.at ▸ http://at.linkedin.com/in/andreasblumauer ▸ https://twitter.com/semwebcompany ▸ https://www.poolparty.biz ▸ https://www.semantic-web.at 29 © Semantic Web Company - http://www.semantic-web.at/ and http://www.poolparty.biz/