From a bird eye's view, the DBpedia Extraction Framework takes a MediaWiki dump as input, and turns it into a knowledge graph. In this talk, I discuss the creation of the DBkWik knowledge graph by applying the DBpedia Extraction Framework to thousands of Wikis.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
1. 5/23/19 Heiko Paulheim 1
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim
2. 5/23/19 Heiko Paulheim 2
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
3. 5/23/19 Heiko Paulheim 3
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
4. 5/23/19 Heiko Paulheim 4
What if…?
• What if we applied the DBpedia EF to every MediaWiki?
• According to WikiApiary, there’s thousands...
7. 5/23/19 Heiko Paulheim 7
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
8. 5/23/19 Heiko Paulheim 8
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
9. 5/23/19 Heiko Paulheim 9
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
10. 5/23/19 Heiko Paulheim 10
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
11. 5/23/19 Heiko Paulheim 11
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
15. 5/23/19 Heiko Paulheim 15
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
16. 5/23/19 Heiko Paulheim 16
Gold Standard DBkWik 1.1
• Schema alignment: manual
• Instance alignment: crowd-sourced
– Using 3x3 Wikis from 3 different topics
– Asking crowdworkers to identify similar pages
– Search was allowed and encouraged
17. 5/23/19 Heiko Paulheim 17
Gold Standard DBkWik 1.1
• Crowdsourcing results
– High inter rater agreement (Fleiss’ Kappa: 0.8762)
– Most mappings are trivial, though
• Possible bias in gold standard
– We pre-selected matching Wikis!
18. 5/23/19 Heiko Paulheim 18
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
19. 5/23/19 Heiko Paulheim 19
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
●
e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
23. 5/23/19 Heiko Paulheim 23
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
24. 5/23/19 Heiko Paulheim 24
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T ⊆ M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
25. 5/23/19 Heiko Paulheim 25
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
26. 5/23/19 Heiko Paulheim 26
Towards Improving Interlinking
• Strategy: ask the experts
– new Knowledge Graph track at OAEI 2018
– seven systems provided results
• Results:
– it is hard to beat the string baseline
– many matching systems rely
on explicit, deep ontologies
●
but we have just shallow schemas
• Possible reasons:
– the problem is too difficult?
– the gold standard is too trivial?
– the ontology lacks formality
27. 5/23/19 Heiko Paulheim 27
Towards Improving Interlinking
• Currently, embedding based methods are on the rise
– e.g., Azmy et al.: “Matching Entities Across Different Knowledge
Graphs with Graph Embeddings”, 2019
– require large-scale training data
28. 5/23/19 Heiko Paulheim 28
Towards Improving Interlinking
• Overcoming issues of first gold standard
– include non-trivial matches
– include non-matches
29. 5/23/19 Heiko Paulheim 29
Towards Improving Interlinking
• Includes trivial and non-trivial matches
– i.e., task gets more demanding
• Low inter-rater agreement: Fleiss’ Kappa 0.02
30. 5/23/19 Heiko Paulheim 30
Towards Improving Interlinking
• Exploiting Wiki Interlinks
30
== External links ==
* {{mbeta}}
* {{Wikipedia|Bajoran#Kai|Kai}}
[[de:Kai]]
[[nl:Kai]]
[[pl:Kai]]
wiki 1
wiki 2
Kai
Meressa
Star Trek
31. 5/23/19 Heiko Paulheim 31
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
32. 5/23/19 Heiko Paulheim 32
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
33. 5/23/19 Heiko Paulheim 33
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
35. 5/23/19 Heiko Paulheim 35
Further Open Challenges
• More detailed profiling
– e.g., do we reduce or increase bias?
• Task-based evaluation
– Does it improve, e.g., recommender systems?
• Fusion policies
– Identify outdated Wikis
36. 5/23/19 Heiko Paulheim 36
Contributors
• DBkWik contributors (past, present, and future)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch
37. 5/23/19 Heiko Paulheim 37
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim