Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

12/10/2019 Heiko Paulheim 1
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim

The New Kids on the (Knowledge Graph) Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...

Knowledge Graphs: Out of the Dark
Google’s
Announcement
DBpedia
YAGO
ResearchCyc WikidataFreebase
NELL

A Brief History of Knowledge Graphs
• The 1980s: Cyc
– Cyc: Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules

A Brief History of Knowledge Graphs
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up

Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/

Combining the Best of Three Worlds
• DBpedia: detailed instance / relation extraction
• YAGO: detailed classes
• Cyc: rich axiomatization
• Goal: get the best of all those worlds
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508

Towards CaLiGraph
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .

Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}

Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?

– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
Albums
Albums
by genre
Albums
by artist
Nine Inch
Nails
albums
The Doors
albums
Rock
albums
Pop
albums
Reggae
albums
The
Beatles
albums
... ...
...

• Results

Improving Instance Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages

CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music

CaLiGraph Statistics
• Class hierarchy from categories and list pages
– 750k classes
• Axioms from categories and list pages
– 200k definitions
• Instances from list pages
– 870k new instances

CaLiGraph Glitches

The Future of CaLiGraph
• Going beyond red links • Going beyond explicit lists

Knowledge Graph Creation Beyond Wikipedia

A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework

An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework

What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...

Why?
• More is better (maybe)

Why?
• Overcoming Wikipedia’s coverage bias

A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens

DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps

Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release

DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki

Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges

Duplicates
• Collecting Data from a Multitude of Wikis

Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}

Data Fusion

Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...

Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• In 2019, some tools beat the baseline
– albeit by a small margin only

Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location

Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
• e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person

Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization

DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566

DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/

DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!

DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T  M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia

DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia

NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML

Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles

Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– e.g., identify outdated information

Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist

Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Similar to Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block (20)

More from Heiko Paulheim

More from Heiko Paulheim (11)

Recently uploaded

Recently uploaded (20)

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block