SlideShare a Scribd company logo
1 of 58
Download to read offline
9/30/21 Heiko Paulheim 1
From Wikis to Knowledge Graphs:
Approaches and Challenges
beyond DBpedia and YAGO
Heiko Paulheim
9/30/21 Heiko Paulheim 2
A Brief History of Knowledge Graphs
Google’s
Announcement
DBpedia
YAGO
ResearchCyc Wikidata
Freebase
NELL
9/30/21 Heiko Paulheim 3
Wikipedia as a Knowledge Graph
• Wikipedia based Knowledge Graphs
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
9/30/21 Heiko Paulheim 4
Wikipedia as a Knowledge Graph
9/30/21 Heiko Paulheim 5
Wikipedia as a Knowledge Graph
city
campus
state
c
i
t
y
9/30/21 Heiko Paulheim 6
Wikipedia as a Knowledge Graph
• Mapping to a central schema/ontology
University
chancellor Person
Organisation
Agent
campus Place
range
range
domain
domain
subclass of
subclass of
subclass of
9/30/21 Heiko Paulheim 7
Wikipedia as a Knowledge Graph
• General characteristics of DBpedia and YAGO:
– Central/schema ontology
• DBpedia: crowdsourcing
• YAGO: WordNet + categories
– Mapping of infobox keys
• DBpedia: crowdsourcing
• YAGO: engineering
– One page per entity
• i.e.: set of entities = set of Wikipedia pages
9/30/21 Heiko Paulheim 8
Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
9/30/21 Heiko Paulheim 9
Why bother?
• Experiment w/
recommender
systems
(LDK 2021)
– Trained on five versions
of DBpedia
– Biases become evident
• examined: genre,
production country
Voit & Paulheim (2021): Bias in Knowledge Graphs.
9/30/21 Heiko Paulheim 10
Why bother?
• One key take away of that paper:
• Rethink parameter tuning and ablation studies!
– We see ablation studies on methods, parameters, etc.
– But rarely on knowledge graphs
– However, there are considerable differences
• observed in this work: factor of 2-3
• Especially:
– entity coverage and level of detail
Voit & Paulheim (2021): Bias in Knowledge Graphs.
9/30/21 Heiko Paulheim 11
Increasing Level of Detail
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
9/30/21 Heiko Paulheim 12
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}
See: ISWC 2019 Paper on Uncovering the Semantics of Wikipedia Categories
9/30/21 Heiko Paulheim 13
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?
9/30/21 Heiko Paulheim 14
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
9/30/21 Heiko Paulheim 15
Cat2Ax: Axiomatizing Wikipedia Categories
• Results
9/30/21 Heiko Paulheim 16
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
9/30/21 Heiko Paulheim 17
Improving Entity Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages
9/30/21 Heiko Paulheim 18
Pushing Entity Coverage Further
• Beyond red links (2020) • Beyond explicit lists (2021)
9/30/21 Heiko Paulheim 19
Entity Extraction from List Pages
• Red and grey links
– Red links point to entities
that do not exist
– “Grey links”
• are actually not links
• i.e., entities to be
discovered
9/30/21 Heiko Paulheim 20
Entity Extraction from List Pages
• Lists form (shallow) hierarchies
9/30/21 Heiko Paulheim 21
Entity Extraction from List Pages
• Idea: align with category graph
• Equivalence:
– “List of Japanese Writers”
↔ Category:Japanese Writers
• Subsumption:
– “List of Japanese
Speculative Fiction Writers”
→ Category:Japanese Writers
9/30/21 Heiko Paulheim 22
Classifying Red Links
• Not all entities on a list page
belong to the same category
• Idea:
– Learn classifier to tell subject
entities from non-subject entities
• Distant learning approach
– Positive examples:
• Entities that are in the
corresponding category
– Negative examples
• Entities that are in a category
which is disjoint
• e.g., Book <> Writer
9/30/21 Heiko Paulheim 23
Classifying Red Links
• Using a mix of features
– Page layout, position of entities, statistical, linguistics, …
9/30/21 Heiko Paulheim 24
Classifying Red Links
• Enumerations work slightly better than tables
• Unevenly balanced
– >70% place, species, and person
9/30/21 Heiko Paulheim 25
Beyond List Pages
• Many pages
contain list-like
constructs
• Usually
– small
– same type
– same relation
to page entity
– more grey links
…
9/30/21 Heiko Paulheim 26
Beyond List Pages
9/30/21 Heiko Paulheim 27
Beyond List Pages
• Learning descriptive rules for listings, e.g.
– topSection(“Discography”) → artist.{>PageEntity<}
– Learning across pages to mitigate small data problems
• Metrics:
– Support: no. of listings covered by rule antecedent
– Confidence: frequency of rule consequent over all covered listings
– Consistency: mean absolute deviation
of overall confidence and listing confidence
• i.e., does the rule work equally well across all covered listings
9/30/21 Heiko Paulheim 28
Beyond List Pages
• Entity detection:
– Specialize SpaCy tagger with entities on Wikipedia list pages
– Use SpaCy tags for filtering (e.g., PER for Person etc.)
• Based on majority vote per class
– tag fit (i.e., proportion of “fitting” tags for class axioms)
used for thresholding
9/30/21 Heiko Paulheim 29
Beyond List Pages
• We can learn
– ~5M rules for types
– ~3k rules for relations
• Identify ~2M new entities
– incl. type and relations
within KG
• Post hoc inspection of axioms:
– Accuracy >90%
Person
34%
Species
24%
Place
11%
Work
6%
Other
25%
9/30/21 Heiko Paulheim 30
CaLiGraph at a Glance
• Latest version 2.1
– 15M entities
• incl. 8M from listings
– Caveat:
• disambiguation!
9/30/21 Heiko Paulheim 31
Entity Disambiguation
• Examples: Wikipedia pages of Die Krupps and Eisbrecher
?
9/30/21 Heiko Paulheim 32
CaLiGraph Glitches
9/30/21 Heiko Paulheim 33
CaLiGraph Challenges & Open Issues
• Entity disambiguation
• Usage of formal ontologies (e.g., country is functional)
• Extracting information directly from the context
Achim_Färber
Drums
instrument
9/30/21 Heiko Paulheim 34
Knowledge Graph Creation Beyond Wikipedia
9/30/21 Heiko Paulheim 35
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
9/30/21 Heiko Paulheim 36
A Satellite View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
9/30/21 Heiko Paulheim 37
What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...
9/30/21 Heiko Paulheim 38
Why?
• More is better (maybe)
9/30/21 Heiko Paulheim 39
Why?
• Overcoming Wikipedia’s coverage bias
9/30/21 Heiko Paulheim 40
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
9/30/21 Heiko Paulheim 41
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
9/30/21 Heiko Paulheim 42
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: >300k Wikis
→ will go into DBkWik 1.2 release
9/30/21 Heiko Paulheim 43
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
9/30/21 Heiko Paulheim 44
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
9/30/21 Heiko Paulheim 45
Duplicates
• Collecting Data from a Multitude of Wikis
9/30/21 Heiko Paulheim 46
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
9/30/21 Heiko Paulheim 47
Data Fusion
9/30/21 Heiko Paulheim 48
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
9/30/21 Heiko Paulheim 49
Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• From 2019, some tools starting beating the baseline
– albeit by a small margin only
9/30/21 Heiko Paulheim 50
The Golden Hammer Bias
• Challenge:
– OAEI tools expect two related KGs
– but: 300k KGs can only be matched
without manual pre-inspection
See: ESWC 2020 Paper on OAEI Knowledge Graph Track
9/30/21 Heiko Paulheim 51
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
Subclass
Materialization
9/30/21 Heiko Paulheim 52
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
9/30/21 Heiko Paulheim 53
Towards DBkWik 1.2
• Again, we have an entity resolution problem
– with entities from 300k sources
• Strategies
• (i) pairwise (O(n²))
• (ii-iv) transitive pairs
• (iii-v) incremental merge
• Ordering by
• smallest/largest first
• source similarity
A B
C D
A B C D
A B
C
AB
D
ABC
A B C D
AB CD
(i)
(ii)
(v) (vi)
A B
C
D
(iii)
Incremental
Merge
Transitive
Pairs
A C
B D
(iv)
Similarity based
Order based
9/30/21 Heiko Paulheim 54
Towards DBkWik 1.2
• Preliminary results
– Incremental merge works best (quality close to pairwise)
• Main challenge: runtime
– One match&merge step takes ~20 minutes
– i.e., 300k steps are more than 11 years!
• Idea: parallelization
– best case: tree height of 18
→ best runtime (fully parallel): six hours
A B
C
AB
D
ABC
E
ABCD
F
ABCDE
G
ABCDEF
H
ABCDEFG
A B C D
AB CD
E F G H
EF GH
ABCD EFGH
9/30/21 Heiko Paulheim 55
Summary
• DBpedia and YAGO
– one source (Wikipedia, multiple languages)
– one entity per page paradigm
• CaLiGraph
– one source (Wikipedia, multiple languages would be possible)
– extraction from list-like constructs
• furtherpossible extension: list-like constructs outside of Wikipedia
– current open challenge: entity resolution
• DBkWik
– extraction from thousands of Wikis
– current open challenge: entity resolution
• in particular: scalability!
9/30/21 Heiko Paulheim 56
Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
– and: is that good or bad?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– schema level,
e.g., many shallow ontologies
→ one deep ontology?
– instance level,
e.g., identify outdated information
9/30/21 Heiko Paulheim 57
Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist
9/30/21 Heiko Paulheim 58
From Wikis to Knowledge Graphs:
Approaches and Challenges
beyond DBpedia and YAGO
Heiko Paulheim

More Related Content

What's hot

New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in LibrariesRichard Wallis
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Nuno Freire
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!Richard Wallis
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationRobert Sanderson
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
They have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersThey have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersRichard Wallis
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?Richard Wallis
 

What's hot (20)

New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need Reconciliation
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
They have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersThey have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library Users
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
 
Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...
 

Similar to From Wikis to Knowledge Graphs

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...ÚISK FF UK
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusUCLDH
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 
FST library session – part 2 2015
FST library session – part 2 2015FST library session – part 2 2015
FST library session – part 2 2015molloylibrarian
 
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonIntro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonSara Snyder
 
DBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfDBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfNermeenKamel7
 

Similar to From Wikis to Knowledge Graphs (12)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Referencing session November 2016
Referencing session November 2016Referencing session November 2016
Referencing session November 2016
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 
FST library session – part 2 2015
FST library session – part 2 2015FST library session – part 2 2015
FST library session – part 2 2015
 
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonIntro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
 
DBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfDBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdf
 

More from Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 

More from Heiko Paulheim (8)

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 

From Wikis to Knowledge Graphs

  • 1. 9/30/21 Heiko Paulheim 1 From Wikis to Knowledge Graphs: Approaches and Challenges beyond DBpedia and YAGO Heiko Paulheim
  • 2. 9/30/21 Heiko Paulheim 2 A Brief History of Knowledge Graphs Google’s Announcement DBpedia YAGO ResearchCyc Wikidata Freebase NELL
  • 3. 9/30/21 Heiko Paulheim 3 Wikipedia as a Knowledge Graph • Wikipedia based Knowledge Graphs – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 4. 9/30/21 Heiko Paulheim 4 Wikipedia as a Knowledge Graph
  • 5. 9/30/21 Heiko Paulheim 5 Wikipedia as a Knowledge Graph city campus state c i t y
  • 6. 9/30/21 Heiko Paulheim 6 Wikipedia as a Knowledge Graph • Mapping to a central schema/ontology University chancellor Person Organisation Agent campus Place range range domain domain subclass of subclass of subclass of
  • 7. 9/30/21 Heiko Paulheim 7 Wikipedia as a Knowledge Graph • General characteristics of DBpedia and YAGO: – Central/schema ontology • DBpedia: crowdsourcing • YAGO: WordNet + categories – Mapping of infobox keys • DBpedia: crowdsourcing • YAGO: engineering – One page per entity • i.e.: set of entities = set of Wikipedia pages
  • 8. 9/30/21 Heiko Paulheim 8 Getting the Most out of Wikipedia • Study for KG-based Recommender Systems* – DBpedia has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  • 9. 9/30/21 Heiko Paulheim 9 Why bother? • Experiment w/ recommender systems (LDK 2021) – Trained on five versions of DBpedia – Biases become evident • examined: genre, production country Voit & Paulheim (2021): Bias in Knowledge Graphs.
  • 10. 9/30/21 Heiko Paulheim 10 Why bother? • One key take away of that paper: • Rethink parameter tuning and ablation studies! – We see ablation studies on methods, parameters, etc. – But rarely on knowledge graphs – However, there are considerable differences • observed in this work: factor of 2-3 • Especially: – entity coverage and level of detail Voit & Paulheim (2021): Bias in Knowledge Graphs.
  • 11. 9/30/21 Heiko Paulheim 11 Increasing Level of Detail • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  • 12. 9/30/21 Heiko Paulheim 12 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music} See: ISWC 2019 Paper on Uncovering the Semantics of Wikipedia Categories
  • 13. 9/30/21 Heiko Paulheim 13 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  • 14. 9/30/21 Heiko Paulheim 14 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre?
  • 15. 9/30/21 Heiko Paulheim 15 Cat2Ax: Axiomatizing Wikipedia Categories • Results
  • 16. 9/30/21 Heiko Paulheim 16 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  • 17. 9/30/21 Heiko Paulheim 17 Improving Entity Coverage: Lists in Wikipedia • Only existing pages have categories – Lists may also link to non-existing pages
  • 18. 9/30/21 Heiko Paulheim 18 Pushing Entity Coverage Further • Beyond red links (2020) • Beyond explicit lists (2021)
  • 19. 9/30/21 Heiko Paulheim 19 Entity Extraction from List Pages • Red and grey links – Red links point to entities that do not exist – “Grey links” • are actually not links • i.e., entities to be discovered
  • 20. 9/30/21 Heiko Paulheim 20 Entity Extraction from List Pages • Lists form (shallow) hierarchies
  • 21. 9/30/21 Heiko Paulheim 21 Entity Extraction from List Pages • Idea: align with category graph • Equivalence: – “List of Japanese Writers” ↔ Category:Japanese Writers • Subsumption: – “List of Japanese Speculative Fiction Writers” → Category:Japanese Writers
  • 22. 9/30/21 Heiko Paulheim 22 Classifying Red Links • Not all entities on a list page belong to the same category • Idea: – Learn classifier to tell subject entities from non-subject entities • Distant learning approach – Positive examples: • Entities that are in the corresponding category – Negative examples • Entities that are in a category which is disjoint • e.g., Book <> Writer
  • 23. 9/30/21 Heiko Paulheim 23 Classifying Red Links • Using a mix of features – Page layout, position of entities, statistical, linguistics, …
  • 24. 9/30/21 Heiko Paulheim 24 Classifying Red Links • Enumerations work slightly better than tables • Unevenly balanced – >70% place, species, and person
  • 25. 9/30/21 Heiko Paulheim 25 Beyond List Pages • Many pages contain list-like constructs • Usually – small – same type – same relation to page entity – more grey links …
  • 26. 9/30/21 Heiko Paulheim 26 Beyond List Pages
  • 27. 9/30/21 Heiko Paulheim 27 Beyond List Pages • Learning descriptive rules for listings, e.g. – topSection(“Discography”) → artist.{>PageEntity<} – Learning across pages to mitigate small data problems • Metrics: – Support: no. of listings covered by rule antecedent – Confidence: frequency of rule consequent over all covered listings – Consistency: mean absolute deviation of overall confidence and listing confidence • i.e., does the rule work equally well across all covered listings
  • 28. 9/30/21 Heiko Paulheim 28 Beyond List Pages • Entity detection: – Specialize SpaCy tagger with entities on Wikipedia list pages – Use SpaCy tags for filtering (e.g., PER for Person etc.) • Based on majority vote per class – tag fit (i.e., proportion of “fitting” tags for class axioms) used for thresholding
  • 29. 9/30/21 Heiko Paulheim 29 Beyond List Pages • We can learn – ~5M rules for types – ~3k rules for relations • Identify ~2M new entities – incl. type and relations within KG • Post hoc inspection of axioms: – Accuracy >90% Person 34% Species 24% Place 11% Work 6% Other 25%
  • 30. 9/30/21 Heiko Paulheim 30 CaLiGraph at a Glance • Latest version 2.1 – 15M entities • incl. 8M from listings – Caveat: • disambiguation!
  • 31. 9/30/21 Heiko Paulheim 31 Entity Disambiguation • Examples: Wikipedia pages of Die Krupps and Eisbrecher ?
  • 32. 9/30/21 Heiko Paulheim 32 CaLiGraph Glitches
  • 33. 9/30/21 Heiko Paulheim 33 CaLiGraph Challenges & Open Issues • Entity disambiguation • Usage of formal ontologies (e.g., country is functional) • Extracting information directly from the context Achim_Färber Drums instrument
  • 34. 9/30/21 Heiko Paulheim 34 Knowledge Graph Creation Beyond Wikipedia
  • 35. 9/30/21 Heiko Paulheim 35 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 36. 9/30/21 Heiko Paulheim 36 A Satellite View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 37. 9/30/21 Heiko Paulheim 37 What if…? • What if we went from Wikipedia every MediaWiki? • According to WikiApiary, there’s thousands...
  • 38. 9/30/21 Heiko Paulheim 38 Why? • More is better (maybe)
  • 39. 9/30/21 Heiko Paulheim 39 Why? • Overcoming Wikipedia’s coverage bias
  • 40. 9/30/21 Heiko Paulheim 40 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 41. 9/30/21 Heiko Paulheim 41 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 42. 9/30/21 Heiko Paulheim 42 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: >300k Wikis → will go into DBkWik 1.2 release
  • 43. 9/30/21 Heiko Paulheim 43 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 44. 9/30/21 Heiko Paulheim 44 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 45. 9/30/21 Heiko Paulheim 45 Duplicates • Collecting Data from a Multitude of Wikis
  • 46. 9/30/21 Heiko Paulheim 46 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 47. 9/30/21 Heiko Paulheim 47 Data Fusion
  • 48. 9/30/21 Heiko Paulheim 48 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 49. 9/30/21 Heiko Paulheim 49 Improving Linking and Fusion • Started a new track at OAEI in 2018 – annual benchmark for matching tools • From 2019, some tools starting beating the baseline – albeit by a small margin only
  • 50. 9/30/21 Heiko Paulheim 50 The Golden Hammer Bias • Challenge: – OAEI tools expect two related KGs – but: 300k KGs can only be matched without manual pre-inspection See: ESWC 2020 Paper on OAEI Knowledge Graph Track
  • 51. 9/30/21 Heiko Paulheim 51 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light Subclass Materialization
  • 52. 9/30/21 Heiko Paulheim 52 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 53. 9/30/21 Heiko Paulheim 53 Towards DBkWik 1.2 • Again, we have an entity resolution problem – with entities from 300k sources • Strategies • (i) pairwise (O(n²)) • (ii-iv) transitive pairs • (iii-v) incremental merge • Ordering by • smallest/largest first • source similarity A B C D A B C D A B C AB D ABC A B C D AB CD (i) (ii) (v) (vi) A B C D (iii) Incremental Merge Transitive Pairs A C B D (iv) Similarity based Order based
  • 54. 9/30/21 Heiko Paulheim 54 Towards DBkWik 1.2 • Preliminary results – Incremental merge works best (quality close to pairwise) • Main challenge: runtime – One match&merge step takes ~20 minutes – i.e., 300k steps are more than 11 years! • Idea: parallelization – best case: tree height of 18 → best runtime (fully parallel): six hours A B C AB D ABC E ABCD F ABCDE G ABCDEF H ABCDEFG A B C D AB CD E F G H EF GH ABCD EFGH
  • 55. 9/30/21 Heiko Paulheim 55 Summary • DBpedia and YAGO – one source (Wikipedia, multiple languages) – one entity per page paradigm • CaLiGraph – one source (Wikipedia, multiple languages would be possible) – extraction from list-like constructs • furtherpossible extension: list-like constructs outside of Wikipedia – current open challenge: entity resolution • DBkWik – extraction from thousands of Wikis – current open challenge: entity resolution • in particular: scalability!
  • 56. 9/30/21 Heiko Paulheim 56 Further Open Challenges • More detailed profiling of knowledge graphs – e.g., do we reduce or increase bias? – and: is that good or bad? • Task-based downstream evaluations – Does it improve, e.g., recommender systems? • Fusion policies – schema level, e.g., many shallow ontologies → one deep ontology? – instance level, e.g., identify outdated information
  • 57. 9/30/21 Heiko Paulheim 57 Contributors • Contributors (past&present) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch Nicolas Heist
  • 58. 9/30/21 Heiko Paulheim 58 From Wikis to Knowledge Graphs: Approaches and Challenges beyond DBpedia and YAGO Heiko Paulheim