Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

115 views

Published on

From a bird eye's view, the DBpedia Extraction Framework takes a MediaWiki dump as input, and turns it into a knowledge graph. In this talk, I discuss the creation of the DBkWik knowledge graph by applying the DBpedia Extraction Framework to thousands of Wikis.

Published in: Data & Analytics
  • Be the first to comment

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

  1. 1. 5/23/19 Heiko Paulheim 1 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim
  2. 2. 5/23/19 Heiko Paulheim 2 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  3. 3. 5/23/19 Heiko Paulheim 3 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  4. 4. 5/23/19 Heiko Paulheim 4 What if…? • What if we applied the DBpedia EF to every MediaWiki? • According to WikiApiary, there’s thousands...
  5. 5. 5/23/19 Heiko Paulheim 5 Why? • More is better (maybe)
  6. 6. 5/23/19 Heiko Paulheim 6 Why? • Overcoming Wikipedia’s coverage bias
  7. 7. 5/23/19 Heiko Paulheim 7 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  8. 8. 5/23/19 Heiko Paulheim 8 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  9. 9. 5/23/19 Heiko Paulheim 9 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  10. 10. 5/23/19 Heiko Paulheim 10 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  11. 11. 5/23/19 Heiko Paulheim 11 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  12. 12. 5/23/19 Heiko Paulheim 12 Duplicates • Collecting Data from a Multitude of Wikis
  13. 13. 5/23/19 Heiko Paulheim 13 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  14. 14. 5/23/19 Heiko Paulheim 14 Data Fusion
  15. 15. 5/23/19 Heiko Paulheim 15 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  16. 16. 5/23/19 Heiko Paulheim 16 Gold Standard DBkWik 1.1 • Schema alignment: manual • Instance alignment: crowd-sourced – Using 3x3 Wikis from 3 different topics – Asking crowdworkers to identify similar pages – Search was allowed and encouraged
  17. 17. 5/23/19 Heiko Paulheim 17 Gold Standard DBkWik 1.1 • Crowdsourcing results – High inter rater agreement (Fleiss’ Kappa: 0.8762) – Most mappings are trivial, though • Possible bias in gold standard – We pre-selected matching Wikis!
  18. 18. 5/23/19 Heiko Paulheim 18 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  19. 19. 5/23/19 Heiko Paulheim 19 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining ● e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  20. 20. 5/23/19 Heiko Paulheim 20 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  21. 21. 5/23/19 Heiko Paulheim 21 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  22. 22. 5/23/19 Heiko Paulheim 22 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  23. 23. 5/23/19 Heiko Paulheim 23 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  24. 24. 5/23/19 Heiko Paulheim 24 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T ⊆ M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  25. 25. 5/23/19 Heiko Paulheim 25 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  26. 26. 5/23/19 Heiko Paulheim 26 Towards Improving Interlinking • Strategy: ask the experts – new Knowledge Graph track at OAEI 2018 – seven systems provided results • Results: – it is hard to beat the string baseline – many matching systems rely on explicit, deep ontologies ● but we have just shallow schemas • Possible reasons: – the problem is too difficult? – the gold standard is too trivial? – the ontology lacks formality
  27. 27. 5/23/19 Heiko Paulheim 27 Towards Improving Interlinking • Currently, embedding based methods are on the rise – e.g., Azmy et al.: “Matching Entities Across Different Knowledge Graphs with Graph Embeddings”, 2019 – require large-scale training data
  28. 28. 5/23/19 Heiko Paulheim 28 Towards Improving Interlinking • Overcoming issues of first gold standard – include non-trivial matches – include non-matches
  29. 29. 5/23/19 Heiko Paulheim 29 Towards Improving Interlinking • Includes trivial and non-trivial matches – i.e., task gets more demanding • Low inter-rater agreement: Fleiss’ Kappa 0.02
  30. 30. 5/23/19 Heiko Paulheim 30 Towards Improving Interlinking • Exploiting Wiki Interlinks 30 == External links == * {{mbeta}} * {{Wikipedia|Bajoran#Kai|Kai}} [[de:Kai]] [[nl:Kai]] [[pl:Kai]] wiki 1 wiki 2 Kai Meressa Star Trek
  31. 31. 5/23/19 Heiko Paulheim 31 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  32. 32. 5/23/19 Heiko Paulheim 32 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  33. 33. 5/23/19 Heiko Paulheim 33 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  34. 34. 5/23/19 Heiko Paulheim 34 Towards DBkWik 1.2
  35. 35. 5/23/19 Heiko Paulheim 35 Further Open Challenges • More detailed profiling – e.g., do we reduce or increase bias? • Task-based evaluation – Does it improve, e.g., recommender systems? • Fusion policies – Identify outdated Wikis
  36. 36. 5/23/19 Heiko Paulheim 36 Contributors • DBkWik contributors (past, present, and future) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch
  37. 37. 5/23/19 Heiko Paulheim 37 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim

×