Knowledge graphs on the Web

Collaboratively building
Knowledge Graphs on the Web
Armin Haller
Associate Professor, ANU

Data deluge
Impossible to
manually process
even a fraction of
this information …
… we need to
prepare for a post-
big data world.

Machine Learning/AI
ML/AI approaches are performing extremely well in dealing
with such massive amounts of data on tasks such as:
– Image Recognition
– Speech Recognition
– Product recommendations
– Question & Answering
– Spam filtering
… and for neither of these applications we need an
explanation of the learned facts.

Machine Learning/AI and its
limitations
However, if it comes to:
– Self driving cars
– Medical diagnosis
– Drug design
– Robot interactions
– Military applications
– etc.
Humans need to understand the rationale of a decision.
– Facebook employs nearly 15,000 people to moderate posts
deemed inappropriate by ML/AI

eXplainable AI
XAI requires
• Encoding of context (Who, What, How,
When...)
• Encoding the semantics of inputs,
outputs and their properties
• Encoding of common sense knowledge
(e.g., one sits on a chair and eats on a
table)

Knowledge Graphs (KGs)
• Performance and explainability of ML
improves when data is given a context
– a Knowledge Graph increases the informative value
of the collected data that is given to the model
Knowledge Graphs [Paulheim 2017]
– describe real-world entities and their interrelations
– define possible classes and relations of entities in a
schema (ontology)
– allow for interrelating arbitrary entities with each
other

• Knowledge graphs are (generally) created collaboratively by many
users
• Information can be added in a relatively arbitrary manner as
structural constraints are few
Closed KGs (~2019) [Noy et al., 2019]
Microsoft ~2bn entities, ~55bn facts
Google ~1bn entities, ~70bn assertions
Facebook ~50m entities, ~500m assertions
eBay ~1bn triples
IBM ~100m entities, 5bn relationships
Open KGs (April 2021)
DBpedia ~4.58m entities, ~9.25GB
Yago4 ~50m entities, ~18.4GB
Wikidata ~93m entities, ~99GB

Graphs
Natural way of
structuring and
presenting
knowledge
Heterogenous
Knowledge from
different sources
can be integrated
and/or interlinked
Schema-later
Schema often not
decided until later,
and does not impose
integrity constraints

Schema in KGs
Ontologies as schemas in KGs
An ontology is an “explicit specification of a conceptualization consisting of a set of
objects, and the describable relationships among them”
[Gruber, 1993]
Components of an Ontology
• Classes: abstract groups (sets) of objects that are defined by properties that all its
members share (e.g., Person, Organisation, Event)
• Attributes: characteristics or parameters that objects (and classes) can have (e.g.,
data of birth, longitude, latitude, timestamp)
• Relationships: ways in which classes and individuals can be related to one another
(e.g., role, attributed to, observed by)
• Individuals: Concrete objects that are inherent to the domain of discourse, such as
specific people, organisations or abstract individuals such as numbers (e.g., g, π)

Limited
many entities
Generic
applies to many
Specific
applies to few
KG modelling detail
Comprehensive
fewer entities
Data
Schema
Q58043963
Q76
Barack Obama
(3,947 axioms)
Armin Haller
(189 axioms)
P361
Q35120
Entity
partOf
minimum
no of players
Chess Person Q73145133
P1872

Types of Schemas (Ontologies)
Level
of
Abstraction
Most
General
Most
Specific
Reusability
Highest
Lowest
Upper
Ontologies
Mid-Level Ontologies
Domain Ontologies
Use-Case Ontologies
e.g., CyC,
SUMO,
DOLCE, BFO,
CYC
e.g., PROV-O,
FOAF, ORG,
SOSA/SSN,
AGRIF
e.g., GO,
ChEBI,
DO,
BTO
[Haller & Polleres, 2020a]

KG Engineering
KG Creation
Extract data
from existing
resources
KG Usage
KG Linking
Add instance
assertions
KG Curation
Add schema
assertions

KG Creation
Top-Down
Schema first,
Data later
Bottom Up
Data first,
Schema later
Data
Schema
Middle-Out

KG Creation (cont’d)
Bottom-Up KG Creation
• Schema is not defined, and data is added organically and manually using tools such as:
– OntoWiki [Frischmuth et al., 2015]
– Semantic MediaWiki [Krötzsch et al., 2006]
– Wikibase
– Schímatos [Wright et al., 2020]
Top-Down KG Creation
• Schema is created upfront, existing data mapped to schema using languages/tools such as:
– R2RML
– SPARQL Generate [Lefrançois et al., 2017]
– SHACL Rules
– TARQL
– Metadata Extractor & Loader (MEL) [Méndez et al., 2021]
– JSON to RDF Mappings (J2RM) [Méndez et al., 2020]
Middle-Out KG Creation [Sure et al., 2004]
• Schema is partly defined upfront based on use cases, with mappings added later when data
defines semantics

Collaboratively building KGs
• Biggest KGs on the Web are built, collaboratively, bottom-up:
– Schema.org Ontology and KG
• Over 10 million sites use Schema.org to markup their web pages and email messages
– Wikidata Ontology and KG
• Wikipedia for Data, 149GB
schema.org Wikidata
Availability • Ontology highly available
• Data availability depending on publisher
• Ontology highly available
• Data highly available
Discoverability • Ontology → Easy
• Instances → Very Difficult
• Ontology → Relatively Difficult
• Instances → Very Easy
Completeness
& Adaptability
• Domain specific (E-Commerce)
• Community extensions available
• (All of) Human Knowledge
Maintenance
& Versioning
• Continuous curation
• Versions are not made explicit
• Continuous curation
• Explicit entity versions + version history
Modularization • Fully distributed, easily accessible,
ontology
• Fully distributed, difficult to access, data
• Fully distributed, relatively difficult to
access, ontology
• Fully distributed, easy to access, data
Quality • High quality ontology
• Low quality data
• High quality ontology
• High quality data

Meta-modelling issues
Without enforced (upfront designed) schemas, KGs suffer from, e.g.:
• Inconsistent modelling of classes/instances
<Q1412680> <P279> <Q28100368> | <Beef Wellington> <subclass of> <Beef Dish>
<Q6497852> <P31> <Q28100665> | <Wiener Schnitzel> <instance of> <Veal Dish>
• Subclassing of disjoint super-classes
<Q190928> <P279> <Q124282> | <shipyard> <subclass of> <dock>
<Q190928> <P279> <Q4830453> | <shipyard> <subclass of> <business>
<Q124282> <P279> <Q7184903> | <shipyard> <subclass of> <abstract object>
<Q190928> <P279> <Q223557> | <shipyard> <subclass of> <physical object>
• Instance of relations between first-order classes
<Q12156> <P31> <Q12136> | <Malaria> <instance of> <Disease>
<Q12156> <P279> <Q12136> | <Malaria> <subclass of> <Disease>
• Redundant/circular inheritances between first-order classes
<Q18557307> <P279> <Q692536> | <muscle tissue disease> <subclass of> <muscular disease>
<Q692536> <P279> <Q18557307> | <muscular disease> <subclass of> <muscle tissue disease>

KG Curation
Correctness
– Evaluation
Accessibility, Accuracy, Consistency, Conciseness, Trustability,
Dynamicity, Representationality [Zaveri et al., 2016]
– Correction
Evaluating data quality (SHACL, SheX)
• Syntactic errors
• Semantic errors
Completeness
– KG Completion [Paulheim, 2017]
Using structural information observed in triples
• Classification
• Probabilistic and Statistical Methods

KG Linking
Internal vs. External links [Haller et al., 2020b]
– internal links, i.e., links between parts of one coherent KG, i.e., edges linking
nodes within the graph
• Link prediction techniques are used to learn those new links
– external links, i.e., links between different KGs, i.e., edges between nodes from
different graphs, or reusing edges from a different graph to link nodes in one KG
Linking Issues [Haller et al., 2020b]
• References to many inaccessible URIs (i.e., broken links) may render
a KG largely useless
• Changes in linked external KGs are out of control of the KG publisher

KG Linking
• Ontology links [Haller et al., 2020b]
– class link
t:[dbo:Person, rdfs:subClassOf, foaf:Person]
– instance typing link
t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, foaf:Person]
– property link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, "Wolfgang
Amadeus Mozart"@en]
– instance role link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:knows, wd:Q51088]
(Antonio Salieri)
• Instance link
t:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]

KG Linking in Wikidata
• Wikidata by far the largest openly available KG, truly built bottom-up
schema (ontology) and data
• Wikidata dump (in HDT) from 3rd of March 2021, 53GB (149GB
uncompressed).
General Statistics
# Triples (Facts) 1,693,668,039
# Subjects 1,625,057,179
# Predicates (edges) 38,867
# Unique objects 2,538,585,808
# Unique entities 89,120,227
# Unique Classes 2,522,595
# Unique Properties 74,309
Links
# Class Links 3,955
(0.001 per class)
# Property Links 835
(0.01 per property)
# Instance Typing Links 0
# Instance Links
• Exact Match (P2888)
• Said to be the Same (P460)
• Inverse Property (P1696)
173,177,045
(1.94 per entity)
3,268,021
2
0

KG Linking in Wikidata
(cont’d)
• Wikidata ontology includes links to other ontologies,
but relatively fewer class and property links
compared to other open KGs on the Web
• Wikidata defines an extensive ontology (schema)
that is used to define entities within its KG
• Wikidata links to other KGs, but uses relatively
less instance links than other KGs on the Web
– Does not (yet) include many similarity relations even
though it should not be the authoritative source for many
of its entities

KG Usage
• Knowledge Management, Knowledge
Discovery
• Training of ML models with KGs
• Conversational Agents
– Q&A
– Personal Assistants
– Chatbots
• Open Data

Conclusions
• Stronger focus on the KG contributors and end user needed
– Tools/methods needed for creating/maintaining KGs
– Tools/methods needed to support querying/analysing KG Schemas
• KGs need to be stronger interlinked, e.g., link prediction
techniques need to be deployed between KGs rather than just
on a single KG
• Improved NLP/NER-based learning techniques needed (distant
supervision) that build s-p-o relations from unstructured text [Mintz et
al., 2009]
• Permanent Distributed querying/replication of data/schema

References
• Hogan, A., et al.: Knowledge Graphs. ACM Computing Surveys (to appear), 2021.
• Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A. , Taylor, J.: Industry-scale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2), 2019.
• Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2):199-220, 1993.
• Frischmuth, P., Martin, M., Tramp, S., Riechert, T., Auer, S.: OntoWiki – An Authoring, Publication and Visualization Interface for the Data Web. Semantic Web, vol. 6,
no. 3, pp. 215-240, 2015.
• Krötzsch, M., Vrandečić, D., Völkel, M.: Semantic MediaWiki. The Semantic Web – ISWC 2006.
• Wright, J., Méndez, S. J. R., Haller, A., Taylor, K., Omran, P. G.: Schímatos: a SHACL-based Web-Form Generator for Knowledge Graph Editing. The Semantic Web –
ISWC 2020.
• Lefrançois, M., Zimmermann, A., Bakerally, N.: A SPARQL Extension for Generating RDF from Heterogeneous Formats. ESWC (1), 2017.
• Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7 (1), 63-93, 2016.
• Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3): 489-508, 2017.
• Berners-Lee, T.: Linked Data. W3C Design Issues. URL: http://www.w3.org/DesignIssues/LinkedData.html, 2006.
• Haller, A., Polleres, A.: Are we better off with just one ontology on the Web? Semantic Web 11(1): 87-99, 2020a.
• Sure, Y., Staab, S., Studer, R., On-To-Knowledge Methodology (OTKM), Handbook on Ontologies (2004) pp 117-132.
• Haller, A., Fernández, J. D., Kamdar, M. R. , Polleres, A.: What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge
Graphs on the Web. ACM J. Data Inf. Qual. 12(2): 9:1-9:34, 2020b.
• Abele, A., McCrae, J. P., Buitelaar, P., Jentzsch, A., Cyganiak, R: Linking open data cloud diagram. URL: http://lod-cloud.net. Insight-Centre. 2017.
• Méndez, S. J. R., Haller, A., Omran, P.G., Wright, J., Taylor, K.: J2RM: An ontology-based JSON-to-RDF Mapping tool. ISWC (Demos/Industry) 2020.
• Méndez, S. J. R., Haller, A., Omran, P.G., Taylor, K.: MEL: Metadata Extractor & Loader. ISWC (Posters/Demos/Industry) 2021.
• Omran, P. G., Taylor, K., Méndez, S. J. R., Haller, A.: Towards SHACL Learning from Knowledge Graphs. ISWC (Demos/Industry) 2020.
• Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Joint Conference of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on Natural Language Processing of the AFNLP, (ACL ‘09), 2009.

Knowledge graphs on the Web

More Related Content

What's hot

Similar to Knowledge graphs on the Web

Recently uploaded

Knowledge graphs on the Web