5/23/19 Heiko Paulheim 1
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim
5/23/19 Heiko Paulheim 2
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
5/23/19 Heiko Paulheim 3
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
5/23/19 Heiko Paulheim 4
What if…?
• What if we applied the DBpedia EF to every MediaWiki?
• According to WikiApiary, there’s thousands...
5/23/19 Heiko Paulheim 5
Why?
• More is better (maybe)
5/23/19 Heiko Paulheim 6
Why?
• Overcoming Wikipedia’s coverage bias
5/23/19 Heiko Paulheim 7
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
5/23/19 Heiko Paulheim 8
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
5/23/19 Heiko Paulheim 9
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
5/23/19 Heiko Paulheim 10
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
5/23/19 Heiko Paulheim 11
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
5/23/19 Heiko Paulheim 12
Duplicates
• Collecting Data from a Multitude of Wikis
5/23/19 Heiko Paulheim 13
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
5/23/19 Heiko Paulheim 14
Data Fusion
5/23/19 Heiko Paulheim 15
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
5/23/19 Heiko Paulheim 16
Gold Standard DBkWik 1.1
• Schema alignment: manual
• Instance alignment: crowd-sourced
– Using 3x3 Wikis from 3 different topics
– Asking crowdworkers to identify similar pages
– Search was allowed and encouraged
5/23/19 Heiko Paulheim 17
Gold Standard DBkWik 1.1
• Crowdsourcing results
– High inter rater agreement (Fleiss’ Kappa: 0.8762)
– Most mappings are trivial, though
• Possible bias in gold standard
– We pre-selected matching Wikis!
5/23/19 Heiko Paulheim 18
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
5/23/19 Heiko Paulheim 19
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
●
e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
5/23/19 Heiko Paulheim 20
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization
5/23/19 Heiko Paulheim 21
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
5/23/19 Heiko Paulheim 22
DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/
5/23/19 Heiko Paulheim 23
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
5/23/19 Heiko Paulheim 24
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T ⊆ M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
5/23/19 Heiko Paulheim 25
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
5/23/19 Heiko Paulheim 26
Towards Improving Interlinking
• Strategy: ask the experts
– new Knowledge Graph track at OAEI 2018
– seven systems provided results
• Results:
– it is hard to beat the string baseline
– many matching systems rely
on explicit, deep ontologies
●
but we have just shallow schemas
• Possible reasons:
– the problem is too difficult?
– the gold standard is too trivial?
– the ontology lacks formality
5/23/19 Heiko Paulheim 27
Towards Improving Interlinking
• Currently, embedding based methods are on the rise
– e.g., Azmy et al.: “Matching Entities Across Different Knowledge
Graphs with Graph Embeddings”, 2019
– require large-scale training data
5/23/19 Heiko Paulheim 28
Towards Improving Interlinking
• Overcoming issues of first gold standard
– include non-trivial matches
– include non-matches
5/23/19 Heiko Paulheim 29
Towards Improving Interlinking
• Includes trivial and non-trivial matches
– i.e., task gets more demanding
• Low inter-rater agreement: Fleiss’ Kappa 0.02
5/23/19 Heiko Paulheim 30
Towards Improving Interlinking
• Exploiting Wiki Interlinks
30
== External links ==
* {{mbeta}}
* {{Wikipedia|Bajoran#Kai|Kai}}
[[de:Kai]]
[[nl:Kai]]
[[pl:Kai]]
wiki 1
wiki 2
Kai
Meressa
Star Trek
5/23/19 Heiko Paulheim 31
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
5/23/19 Heiko Paulheim 32
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
5/23/19 Heiko Paulheim 33
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
5/23/19 Heiko Paulheim 34
Towards DBkWik 1.2
5/23/19 Heiko Paulheim 35
Further Open Challenges
• More detailed profiling
– e.g., do we reduce or increase bias?
• Task-based evaluation
– Does it improve, e.g., recommender systems?
• Fusion policies
– Identify outdated Wikis
5/23/19 Heiko Paulheim 36
Contributors
• DBkWik contributors (past, present, and future)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch
5/23/19 Heiko Paulheim 37
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

  • 1.
    5/23/19 Heiko Paulheim1 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim
  • 2.
    5/23/19 Heiko Paulheim2 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 3.
    5/23/19 Heiko Paulheim3 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 4.
    5/23/19 Heiko Paulheim4 What if…? • What if we applied the DBpedia EF to every MediaWiki? • According to WikiApiary, there’s thousands...
  • 5.
    5/23/19 Heiko Paulheim5 Why? • More is better (maybe)
  • 6.
    5/23/19 Heiko Paulheim6 Why? • Overcoming Wikipedia’s coverage bias
  • 7.
    5/23/19 Heiko Paulheim7 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 8.
    5/23/19 Heiko Paulheim8 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 9.
    5/23/19 Heiko Paulheim9 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  • 10.
    5/23/19 Heiko Paulheim10 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 11.
    5/23/19 Heiko Paulheim11 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 12.
    5/23/19 Heiko Paulheim12 Duplicates • Collecting Data from a Multitude of Wikis
  • 13.
    5/23/19 Heiko Paulheim13 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 14.
    5/23/19 Heiko Paulheim14 Data Fusion
  • 15.
    5/23/19 Heiko Paulheim15 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 16.
    5/23/19 Heiko Paulheim16 Gold Standard DBkWik 1.1 • Schema alignment: manual • Instance alignment: crowd-sourced – Using 3x3 Wikis from 3 different topics – Asking crowdworkers to identify similar pages – Search was allowed and encouraged
  • 17.
    5/23/19 Heiko Paulheim17 Gold Standard DBkWik 1.1 • Crowdsourcing results – High inter rater agreement (Fleiss’ Kappa: 0.8762) – Most mappings are trivial, though • Possible bias in gold standard – We pre-selected matching Wikis!
  • 18.
    5/23/19 Heiko Paulheim18 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  • 19.
    5/23/19 Heiko Paulheim19 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining ● e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  • 20.
    5/23/19 Heiko Paulheim20 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  • 21.
    5/23/19 Heiko Paulheim21 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 22.
    5/23/19 Heiko Paulheim22 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  • 23.
    5/23/19 Heiko Paulheim23 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  • 24.
    5/23/19 Heiko Paulheim24 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T ⊆ M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  • 25.
    5/23/19 Heiko Paulheim25 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  • 26.
    5/23/19 Heiko Paulheim26 Towards Improving Interlinking • Strategy: ask the experts – new Knowledge Graph track at OAEI 2018 – seven systems provided results • Results: – it is hard to beat the string baseline – many matching systems rely on explicit, deep ontologies ● but we have just shallow schemas • Possible reasons: – the problem is too difficult? – the gold standard is too trivial? – the ontology lacks formality
  • 27.
    5/23/19 Heiko Paulheim27 Towards Improving Interlinking • Currently, embedding based methods are on the rise – e.g., Azmy et al.: “Matching Entities Across Different Knowledge Graphs with Graph Embeddings”, 2019 – require large-scale training data
  • 28.
    5/23/19 Heiko Paulheim28 Towards Improving Interlinking • Overcoming issues of first gold standard – include non-trivial matches – include non-matches
  • 29.
    5/23/19 Heiko Paulheim29 Towards Improving Interlinking • Includes trivial and non-trivial matches – i.e., task gets more demanding • Low inter-rater agreement: Fleiss’ Kappa 0.02
  • 30.
    5/23/19 Heiko Paulheim30 Towards Improving Interlinking • Exploiting Wiki Interlinks 30 == External links == * {{mbeta}} * {{Wikipedia|Bajoran#Kai|Kai}} [[de:Kai]] [[nl:Kai]] [[pl:Kai]] wiki 1 wiki 2 Kai Meressa Star Trek
  • 31.
    5/23/19 Heiko Paulheim31 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  • 32.
    5/23/19 Heiko Paulheim32 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 33.
    5/23/19 Heiko Paulheim33 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 34.
    5/23/19 Heiko Paulheim34 Towards DBkWik 1.2
  • 35.
    5/23/19 Heiko Paulheim35 Further Open Challenges • More detailed profiling – e.g., do we reduce or increase bias? • Task-based evaluation – Does it improve, e.g., recommender systems? • Fusion policies – Identify outdated Wikis
  • 36.
    5/23/19 Heiko Paulheim36 Contributors • DBkWik contributors (past, present, and future) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch
  • 37.
    5/23/19 Heiko Paulheim37 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim