Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Unsupervised Learning
of an
Extensive and Usable
Taxonomy
for DBpedia
Presented by Claus Stadler
Vienna, 17th September 2015
Marco Fossati, Dimitris Kontokostas, and Jens Lehmann

Outline
I.Motivation
II.Research Contribution
III.Problem & Solution
IV.Approach
V.Results & Evaluation
VI.Advantages & Drawbacks
VII.Conclusion & Future Work
2

DBpedia
General-purpose Knowledge Base
Central hub for Linked Data

Heterogeneous granularity:
Lack of coverage: 2.8 M typed resources out of 4.9 M
Wikipedia Category System
Chaotic: cycles
Too high granularity: "Radio Stations in Traverse
City, Michigan"
DBpedia ontology (DBPO)
Organisation
Band
SambaSchool
???

Exhaustive type coverage
Focus on usability
Replication across language chapters
Unsupervised approach
7

Problem
Untyped resources in DBpedia (coverage)
Total entries = 4.9 million
Typed entries = 2.8 million
Unbalanced DBPO
9

Solution
Wikipedia Category System
Almost complete knowledge backbone
Identify a prominent subset
Learn a type taxonomy
10

1.Leaf Node Extraction
2.Prominent Node Discovery
3.Type Taxonomy Generation (T-Box)
4.Pages Type Assignment (A-Box)
12

Stage 1:
Leaf Node Extraction
INPUT = cyclic graph; OUTPUT = tree
Bottom-up approach: from leaves to the root
Extract categories linked to actual articles only
Set of categories with no sub-categories =
Leaf Nodes Set:
13
Inuit_deities
Ugandan_monarchies
Inuit_goddesses

Stage 2:
Prominent Node Discovery
(A)Leaf Graph Traversal
(B)Natural Language Processing for is-a relations
(C) Interlanguage Links Weight
14

Stage 2A:
Leaf Graph Traversal
INPUT = leaf nodes set
For each leaf L :
Get parents;
For each parent P :
Are all its children leaves?
YES: P is a prominent node
NO: L is a prominent node
15
Inuit_goddesses
Inuit_deities
Ugandan_monarchies

Stage 2B:
NLP for is-a relations
Category = Noun Phrase (NP)
HEAD extraction
Shallow syntactic parsing
Is the HEAD plural?
YES: class candidate;
Depluralize
16
Deity Monarchy

Stage 2C:
Interlanguage Links Weight
The more interlanguage links a category has, the more it
is used across language editions
Prune categories with interlanguage links < Threshold
Threshold = 3
17
Inuit_deities Ugandan_monarchies

Stage 3:
T-Box
Cycle removal
Breadth-ﬁrst, top-down
Short paths
Instance pruning
18
Monarchy < owl#Thing

Stage 4:
A-Box
INPUT = prominent nodes heads
For each prominent node head H :
Extract the category set with head = H
Extract the page set for each category ;
For each page P :
Is it an article page?
YES: < P, instance-of, H >
NO: Repeat until it is
19
Monarchy
Bengal_Sultanate
instance-of

Data
T-Box
(classes)
A-Box
(assertions)
Typed
resources
1,902 10,729,507 4,260,530
21

Coverage
System Ratio
DBPO 0.513
MENTA 0.537
SDType 0.147
YAGO 0.673
WiBi 0.794
DBTax 0.994
22

T-Box Evaluation:
Settings
50 random classes
10 evaluators (peers)
Resource namespaces hidden to avoid bias

T-Box Evaluation:
Questions (1/2)
• “Is this a class or an instance?” 
Restaurant VS Puella_Magi_Madoka_Magica (movie)
• “Can this class be broken down into more than one
class? 
Mountain VS Musical_groups_from_Gothenburg
• “Is this a valid class hierarchy path?” 
wikicategory_Golden_Bear_winners < yagoLegalActorGeo <
owl#Thing

T-Box Evaluation:
Questions (2/2)
• “Is this hierarchy too speciﬁc?” (too many levels) 
Porter_County,_Indiana < Chicago_metropolitan_area <
Metropolitan_areas_of_Illinois < Populated_places_in_Illinois <
owl#Thing
• “Is this hierarchy too broad?” (very few levels) 
Gonorynchiforme (ﬁsh family) < owl:Thing

Resource Classes
Non-
breakable
classes
Valid
paths
Not too
speciﬁc
paths
Not too
broad
paths
DBPO 0.66 0.67 0.89 0.97 0.84
YAGO 0.90 0.38 0.81 0.55 0.93
WiBi 0.75 0.38 0.73 0.41 0.85
Wikidata 0.19 0.48 0.85 0.66 0.88
Wikipedia 0.81 0.29 0.66 0.77 0.78
DBTax 0.65 0.76 0.77 0.98 0.40
T-Box Evaluation: Results
26

A-Box Evaluation:
Settings
Crowdsourced to the layman
Evaluation set: 500 random entities with no type in
DBpedia
5 judgments per entity
Prevent a worker from answering the same question
twice

A-Box Evaluation:
Test Questions
Automatically discard untrusted judgments
Untrusted worker: < 80% correct test questions
Subjective task: missed test questions
They affect # of untrusted judgments
The class label may be ambiguous

Resource Precision Recall F1
Untrusted
judgments
Wikidata 0.808 0.982 0.886 1,847
MENTA 0.793 0.589 0.675 1,093
SDType 0.924 0.098 0.178 1,723
YAGO 0.461 0.727 0.565 1,358
WiBi 0.858 0.597 0.704 2,075
DBTax 0.744 1 0.853 518
A-Box Evaluation: Results
30

Advantages
Exhaustive coverage (almost 100%)
Type coverage comparison
Recall in A-Box evaluation
Intuitive
Crowdsourced (the layman) A-Box evaluation
Least # of untrusted judgments
32

Drawbacks
Short hierarchy paths
Cycle removal
Instance pruning
Relatively low precision
NLP may still yield "weird" is-a relations
"Elvis Presley is a Burial"
33
!!!

{7}
Conclusion & Future
Work
34

Conclusion
Signiﬁcant type coverage leap
Intuitive for end users
Balance between DBPO (too generic) and YAGO (too
speciﬁc)
Integrated in the latest DBpedia release
35

Future Work
Merge the T-Box into mappings.dbpedia.org for
curation
Word Sense Disambiguation for homonymous classes
Multilingual deployment (currently English and Italian)
36

http://it.dbpedia.org/resource/Elvis_Presley
DBTax in action
at the Italian DBpedia chapter
37

Thanks for your
attention!
Download DBTax at:
http://downloads.dbpedia.org/current/core-i18n/en/
Browse the Italian DBTax at:
http://it.dbpedia.org/sparql
Contact the ﬁrst author at:
fossati@fbk.eu

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia

Similar to Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia (20)

More from Marco Fossati

More from Marco Fossati (8)

Recently uploaded

Recently uploaded (20)

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia