From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

5/23/19 Heiko Paulheim 1
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim

A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework

An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework

What if…?
• What if we applied the DBpedia EF to every MediaWiki?
• According to WikiApiary, there’s thousands...

Why?
• More is better (maybe)

Why?
• Overcoming Wikipedia’s coverage bias

A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens

DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps

Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release

DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki

Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges

Duplicates
• Collecting Data from a Multitude of Wikis

Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}

Data Fusion

Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...

Gold Standard DBkWik 1.1
• Schema alignment: manual
• Instance alignment: crowd-sourced
– Using 3x3 Wikis from 3 different topics
– Asking crowdworkers to identify similar pages
– Search was allowed and encouraged

Gold Standard DBkWik 1.1
• Crowdsourcing results
– High inter rater agreement (Fleiss’ Kappa: 0.8762)
– Most mappings are trivial, though
• Possible bias in gold standard
– We pre-selected matching Wikis!

Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location

Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
●
e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person

Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization

DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566

DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/

DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!

DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T ⊆ M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia

DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia

Towards Improving Interlinking
• Strategy: ask the experts
– new Knowledge Graph track at OAEI 2018
– seven systems provided results
• Results:
– it is hard to beat the string baseline
– many matching systems rely
on explicit, deep ontologies
●
but we have just shallow schemas
• Possible reasons:
– the problem is too difficult?
– the gold standard is too trivial?
– the ontology lacks formality

• Currently, embedding based methods are on the rise
– e.g., Azmy et al.: “Matching Entities Across Different Knowledge
Graphs with Graph Embeddings”, 2019
– require large-scale training data

• Overcoming issues of first gold standard
– include non-trivial matches
– include non-matches

• Includes trivial and non-trivial matches
– i.e., task gets more demanding
• Low inter-rater agreement: Fleiss’ Kappa 0.02

• Exploiting Wiki Interlinks
30
== External links ==
* {{mbeta}}
* {{Wikipedia|Bajoran#Kai|Kai}}
[[de:Kai]]
[[nl:Kai]]
[[pl:Kai]]
wiki 1
wiki 2
Kai
Meressa
Star Trek

NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML

Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles

Towards DBkWik 1.2

Further Open Challenges
• More detailed profiling
– e.g., do we reduce or increase bias?
• Task-based evaluation
– Does it improve, e.g., recommender systems?
• Fusion policies
– Identify outdated Wikis

Contributors
• DBkWik contributors (past, present, and future)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch

From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

Similar to From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph (20)

More from Heiko Paulheim

More from Heiko Paulheim (8)

Recently uploaded

Recently uploaded (20)

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph