SlideShare a Scribd company logo
1 of 47
12/10/2019 Heiko Paulheim 1
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim
12/10/2019 Heiko Paulheim 2
The New Kids on the (Knowledge Graph) Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...
12/10/2019 Heiko Paulheim 3
Knowledge Graphs: Out of the Dark
Google’s
Announcement
DBpedia
YAGO
ResearchCyc WikidataFreebase
NELL
12/10/2019 Heiko Paulheim 4
A Brief History of Knowledge Graphs
• The 1980s: Cyc
– Cyc: Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
12/10/2019 Heiko Paulheim 5
A Brief History of Knowledge Graphs
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
12/10/2019 Heiko Paulheim 6
Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
12/10/2019 Heiko Paulheim 7
Combining the Best of Three Worlds
• DBpedia: detailed instance / relation extraction
• YAGO: detailed classes
• Cyc: rich axiomatization
• Goal: get the best of all those worlds
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508
12/10/2019 Heiko Paulheim 8
Towards CaLiGraph
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
12/10/2019 Heiko Paulheim 9
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}
12/10/2019 Heiko Paulheim 10
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?
12/10/2019 Heiko Paulheim 11
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
12/10/2019 Heiko Paulheim 12
Cat2Ax: Axiomatizing Wikipedia Categories
• Results
12/10/2019 Heiko Paulheim 13
Improving Instance Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages
12/10/2019 Heiko Paulheim 14
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
12/10/2019 Heiko Paulheim 15
CaLiGraph Statistics
• Class hierarchy from categories and list pages
– 750k classes
• Axioms from categories and list pages
– 200k definitions
• Instances from list pages
– 870k new instances
12/10/2019 Heiko Paulheim 16
CaLiGraph Glitches
12/10/2019 Heiko Paulheim 17
The Future of CaLiGraph
• Going beyond red links • Going beyond explicit lists
12/10/2019 Heiko Paulheim 18
Knowledge Graph Creation Beyond Wikipedia
12/10/2019 Heiko Paulheim 19
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
12/10/2019 Heiko Paulheim 20
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
12/10/2019 Heiko Paulheim 21
What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...
12/10/2019 Heiko Paulheim 22
Why?
• More is better (maybe)
12/10/2019 Heiko Paulheim 23
Why?
• Overcoming Wikipedia’s coverage bias
12/10/2019 Heiko Paulheim 24
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
12/10/2019 Heiko Paulheim 25
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
12/10/2019 Heiko Paulheim 26
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
12/10/2019 Heiko Paulheim 27
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
12/10/2019 Heiko Paulheim 28
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
12/10/2019 Heiko Paulheim 29
Duplicates
• Collecting Data from a Multitude of Wikis
12/10/2019 Heiko Paulheim 30
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
12/10/2019 Heiko Paulheim 31
Data Fusion
12/10/2019 Heiko Paulheim 32
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
12/10/2019 Heiko Paulheim 33
Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• In 2019, some tools beat the baseline
– albeit by a small margin only
12/10/2019 Heiko Paulheim 34
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
12/10/2019 Heiko Paulheim 35
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
• e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
12/10/2019 Heiko Paulheim 36
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization
12/10/2019 Heiko Paulheim 37
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
12/10/2019 Heiko Paulheim 38
DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/
12/10/2019 Heiko Paulheim 39
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
12/10/2019 Heiko Paulheim 40
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T  M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
12/10/2019 Heiko Paulheim 41
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
12/10/2019 Heiko Paulheim 42
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
12/10/2019 Heiko Paulheim 43
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
12/10/2019 Heiko Paulheim 44
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
12/10/2019 Heiko Paulheim 45
Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– e.g., identify outdated information
12/10/2019 Heiko Paulheim 46
Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist
12/10/2019 Heiko Paulheim 47
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim

More Related Content

What's hot

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about itHerbert Van de Sompel
 
The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...Lukas Koster
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Lars G. Svensson
 
Linked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationLinked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationSebastian Hellmann
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?Richard Wallis
 

What's hot (20)

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized Web
 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about it
 
The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Linked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationLinked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and Segmentation
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
 

Similar to Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...CILIP MDG
 
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonIntro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonJack Brighton
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to OmekaShawn Day
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 PrattSILS
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Lars G. Svensson
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital ObjectsShawn Day
 
Buy Custom Essay Papers
Buy Custom Essay PapersBuy Custom Essay Papers
Buy Custom Essay PapersKrystal Fallin
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
The Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Metropolitan Museum of Art
 
Lessonplan history
Lessonplan historyLessonplan history
Lessonplan historySusan Ferdon
 
Open Culture - How Wiki loves art and data - Packed
 Open Culture - How Wiki loves art and data - Packed Open Culture - How Wiki loves art and data - Packed
Open Culture - How Wiki loves art and data - PackedOpen Knowledge Belgium
 
Linked Open Data Publications through Wikidata & Persistent Identification...
Linked Open Data  Publications through  Wikidata &  Persistent Identification...Linked Open Data  Publications through  Wikidata &  Persistent Identification...
Linked Open Data Publications through Wikidata & Persistent Identification...PACKED vzw
 
Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...meemoo, Vlaams instituut voor het archief
 
20130527 library linkeddata
20130527 library linkeddata20130527 library linkeddata
20130527 library linkeddataStefan Gradmann
 

Similar to Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block (20)

Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
 
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonIntro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
Seminario Cristian Lai, 06-09-2012
Seminario Cristian Lai, 06-09-2012Seminario Cristian Lai, 06-09-2012
Seminario Cristian Lai, 06-09-2012
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital Objects
 
Factual proposal
Factual proposalFactual proposal
Factual proposal
 
Buy Custom Essay Papers
Buy Custom Essay PapersBuy Custom Essay Papers
Buy Custom Essay Papers
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
The Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit Sector
 
Lessonplan history
Lessonplan historyLessonplan history
Lessonplan history
 
One Big Library
One Big LibraryOne Big Library
One Big Library
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Open Culture - How Wiki loves art and data - Packed
 Open Culture - How Wiki loves art and data - Packed Open Culture - How Wiki loves art and data - Packed
Open Culture - How Wiki loves art and data - Packed
 
Linked Open Data Publications through Wikidata & Persistent Identification...
Linked Open Data  Publications through  Wikidata &  Persistent Identification...Linked Open Data  Publications through  Wikidata &  Persistent Identification...
Linked Open Data Publications through Wikidata & Persistent Identification...
 
Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...
 
20130527 library linkeddata
20130527 library linkeddata20130527 library linkeddata
20130527 library linkeddata
 

More from Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 

More from Heiko Paulheim (11)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Recently uploaded

Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 

Recently uploaded (20)

Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

  • 1. 12/10/2019 Heiko Paulheim 1 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim
  • 2. 12/10/2019 Heiko Paulheim 2 The New Kids on the (Knowledge Graph) Block Subjective age: Measured by the fraction of the audience that understands a reference to your young days’ pop culture...
  • 3. 12/10/2019 Heiko Paulheim 3 Knowledge Graphs: Out of the Dark Google’s Announcement DBpedia YAGO ResearchCyc WikidataFreebase NELL
  • 4. 12/10/2019 Heiko Paulheim 4 A Brief History of Knowledge Graphs • The 1980s: Cyc – Cyc: Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules
  • 5. 12/10/2019 Heiko Paulheim 5 A Brief History of Knowledge Graphs • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 6. 12/10/2019 Heiko Paulheim 6 Getting the Most out of Wikipedia • Study for KG-based Recommender Systems* – DBpedia has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  • 7. 12/10/2019 Heiko Paulheim 7 Combining the Best of Three Worlds • DBpedia: detailed instance / relation extraction • YAGO: detailed classes • Cyc: rich axiomatization • Goal: get the best of all those worlds public private Paulheim: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8:3 (2017), pp. 489-508
  • 8. 12/10/2019 Heiko Paulheim 8 Towards CaLiGraph • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  • 9. 12/10/2019 Heiko Paulheim 9 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music}
  • 10. 12/10/2019 Heiko Paulheim 10 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  • 11. 12/10/2019 Heiko Paulheim 11 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre? Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...
  • 12. 12/10/2019 Heiko Paulheim 12 Cat2Ax: Axiomatizing Wikipedia Categories • Results
  • 13. 12/10/2019 Heiko Paulheim 13 Improving Instance Coverage: Lists in Wikipedia • Only existing pages have categories – Lists may also link to non-existing pages
  • 14. 12/10/2019 Heiko Paulheim 14 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  • 15. 12/10/2019 Heiko Paulheim 15 CaLiGraph Statistics • Class hierarchy from categories and list pages – 750k classes • Axioms from categories and list pages – 200k definitions • Instances from list pages – 870k new instances
  • 16. 12/10/2019 Heiko Paulheim 16 CaLiGraph Glitches
  • 17. 12/10/2019 Heiko Paulheim 17 The Future of CaLiGraph • Going beyond red links • Going beyond explicit lists
  • 18. 12/10/2019 Heiko Paulheim 18 Knowledge Graph Creation Beyond Wikipedia
  • 19. 12/10/2019 Heiko Paulheim 19 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 20. 12/10/2019 Heiko Paulheim 20 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 21. 12/10/2019 Heiko Paulheim 21 What if…? • What if we went from Wikipedia every MediaWiki? • According to WikiApiary, there’s thousands...
  • 22. 12/10/2019 Heiko Paulheim 22 Why? • More is better (maybe)
  • 23. 12/10/2019 Heiko Paulheim 23 Why? • Overcoming Wikipedia’s coverage bias
  • 24. 12/10/2019 Heiko Paulheim 24 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 25. 12/10/2019 Heiko Paulheim 25 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 26. 12/10/2019 Heiko Paulheim 26 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  • 27. 12/10/2019 Heiko Paulheim 27 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 28. 12/10/2019 Heiko Paulheim 28 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 29. 12/10/2019 Heiko Paulheim 29 Duplicates • Collecting Data from a Multitude of Wikis
  • 30. 12/10/2019 Heiko Paulheim 30 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 31. 12/10/2019 Heiko Paulheim 31 Data Fusion
  • 32. 12/10/2019 Heiko Paulheim 32 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 33. 12/10/2019 Heiko Paulheim 33 Improving Linking and Fusion • Started a new track at OAEI in 2018 – annual benchmark for matching tools • In 2019, some tools beat the baseline – albeit by a small margin only
  • 34. 12/10/2019 Heiko Paulheim 34 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  • 35. 12/10/2019 Heiko Paulheim 35 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining • e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  • 36. 12/10/2019 Heiko Paulheim 36 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  • 37. 12/10/2019 Heiko Paulheim 37 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 38. 12/10/2019 Heiko Paulheim 38 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  • 39. 12/10/2019 Heiko Paulheim 39 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  • 40. 12/10/2019 Heiko Paulheim 40 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T  M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  • 41. 12/10/2019 Heiko Paulheim 41 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  • 42. 12/10/2019 Heiko Paulheim 42 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  • 43. 12/10/2019 Heiko Paulheim 43 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 44. 12/10/2019 Heiko Paulheim 44 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 45. 12/10/2019 Heiko Paulheim 45 Further Open Challenges • More detailed profiling of knowledge graphs – e.g., do we reduce or increase bias? • Task-based downstream evaluations – Does it improve, e.g., recommender systems? • Fusion policies – e.g., identify outdated information
  • 46. 12/10/2019 Heiko Paulheim 46 Contributors • Contributors (past&present) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch Nicolas Heist
  • 47. 12/10/2019 Heiko Paulheim 47 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim