Starting with Cyc in the 1980s, the collection of general
knowledge in machine interpretable form has been considered a valuable ingredient in intelligent and knowledge intensive applications. Notable contributions in the field include the Wikipedia-based datasets DBpedia and YAGO, as well as the collaborative knowledge base Wikidata. Since Google has coined the term in 2012, they are most often referred to as knowledge graphs. Besides such open knowledge graphs, many companies have started using corporate knowledge graphs as a means of information representation.
In this talk, I will look at two ongoing projects related to the extraction of knowledge graphs from Wikipedia and other Wikis. The first new dataset, CaLiGraph, aims at the generation of explicit formal definitions from categories, and the extraction of new instances from list pages. In its current release, CaLiGraph contains 200k axioms defining classes,
and more than 7M typed instances. In the second part, I will look at the transfer of the DBpedia approach to a multitude of arbitrary Wikis. The first such prototype, DBkWik, extracts data from Fandom, a Wiki farm hosting more than 400k different Wikis on various topics. Unlike DBpedia, which relies on a larger user base for crowdsourcing an explicit schema and extraction rules, and the "one-page-per-entity" assumption, DBkWik has to address various challenges in the fields of schema learning and data integration. In its current release, DBkWik contains more than 11M entities, and has been found to be highly complementary to DBpedia.
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
1. 12/10/2019 Heiko Paulheim 1
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim
2. 12/10/2019 Heiko Paulheim 2
The New Kids on the (Knowledge Graph) Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...
3. 12/10/2019 Heiko Paulheim 3
Knowledge Graphs: Out of the Dark
Google’s
Announcement
DBpedia
YAGO
ResearchCyc WikidataFreebase
NELL
4. 12/10/2019 Heiko Paulheim 4
A Brief History of Knowledge Graphs
• The 1980s: Cyc
– Cyc: Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
5. 12/10/2019 Heiko Paulheim 5
A Brief History of Knowledge Graphs
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
6. 12/10/2019 Heiko Paulheim 6
Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
7. 12/10/2019 Heiko Paulheim 7
Combining the Best of Three Worlds
• DBpedia: detailed instance / relation extraction
• YAGO: detailed classes
• Cyc: rich axiomatization
• Goal: get the best of all those worlds
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508
8. 12/10/2019 Heiko Paulheim 8
Towards CaLiGraph
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
9. 12/10/2019 Heiko Paulheim 9
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...
dbo:Album
dbo:artist.{dbr:Nine_Inch_Nails}
dbo:genre.{dbr:Rock_Music}
10. 12/10/2019 Heiko Paulheim 10
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...
dbo:genre.{dbr:Rock_Music} ?
dbo:artist.{dbr:Rock_(Rapper)} ?
11. 12/10/2019 Heiko Paulheim 11
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...
13. 12/10/2019 Heiko Paulheim 13
Improving Instance Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages
14. 12/10/2019 Heiko Paulheim 14
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
15. 12/10/2019 Heiko Paulheim 15
CaLiGraph Statistics
• Class hierarchy from categories and list pages
– 750k classes
• Axioms from categories and list pages
– 200k definitions
• Instances from list pages
– 870k new instances
19. 12/10/2019 Heiko Paulheim 19
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
20. 12/10/2019 Heiko Paulheim 20
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
21. 12/10/2019 Heiko Paulheim 21
What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...
24. 12/10/2019 Heiko Paulheim 24
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
25. 12/10/2019 Heiko Paulheim 25
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
26. 12/10/2019 Heiko Paulheim 26
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
27. 12/10/2019 Heiko Paulheim 27
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
28. 12/10/2019 Heiko Paulheim 28
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
32. 12/10/2019 Heiko Paulheim 32
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
33. 12/10/2019 Heiko Paulheim 33
Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• In 2019, some tools beat the baseline
– albeit by a small margin only
34. 12/10/2019 Heiko Paulheim 34
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
35. 12/10/2019 Heiko Paulheim 35
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
• e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
39. 12/10/2019 Heiko Paulheim 39
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
40. 12/10/2019 Heiko Paulheim 40
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
41. 12/10/2019 Heiko Paulheim 41
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
42. 12/10/2019 Heiko Paulheim 42
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
43. 12/10/2019 Heiko Paulheim 43
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
44. 12/10/2019 Heiko Paulheim 44
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
45. 12/10/2019 Heiko Paulheim 45
Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– e.g., identify outdated information
46. 12/10/2019 Heiko Paulheim 46
Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist
47. 12/10/2019 Heiko Paulheim 47
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim