Knowledge Graphs
Dieter Fensel
with the help of of Umutcan Şimşek, Kevin Angele, Elwin Huaman, Elias Kärle,
Oleksandra Panasiuk, Ioan Toma, Jürgen Umbrich, and Alexander Wahler
STI Innsbruck, University of Innsbruck, Austria
June 29, 2020
Knowledge Graphs
1. Motivation
2. KG Methodology
3. Knowledge Generation,
4. Knowledge Hosting,
5. Knowledge Curation (assessment, cleaning, and enrichment)
6. Knowledge Deployment
7. The Proof Of The Pudding Is In The Eating
2
More infos
• Dieter Fensel, Umutcan Şimşek, Kevin Angele, Elwin
Huaman, Elias Kärle, Oleksandra Panasiuk, Ioan Toma,
Jürgen Umbrich, and Alexander Wahler: Knowledge
Graphs - Methodology, Tools, and selected Use Cases,
Springer, 2020.
• MindLab project: mindlab.ai
• https://www.slideshare.net/STI-Innsbruck/building-a-knowledge-
graph-from-schemaorg-annotations-236256670
• https://www.slideshare.net/STI-Innsbruck/how-to-build-a-
knowledge-graph-236256713
3
1. Motivation
Evolving Technologies for eMarketing and eCommerce
The Web
Search
Semantic Web
Query Answering
Knowledge Graph
Goal and Service
Oriented Dialoque
8
1. Motivation
Evolving Technologies for eMarketing and eCommerce
The Web
Search
Semantic Web
Query Answering
Knowledge Graph
Goal and Service
Oriented Dialoque
9
1. Motivation
• The quality of Intelligent Assistants depends directly on the quality of the
Knowledge Graph
• Problem: “Garbage in Garbage out”
• Requirements for the Knowledge Graph:
• well structured (using an ontology - schema.org)
• accurate information (correctness)
• large and detailed coverage (completeness)
• Timeliness of knowledge
==> Method- andTool-supported Knowledge Graph Lifecycle
10
3. Knowledge Generation
• We define domain-specific extensions (that also restrict the genericity of
entire schema.org).
• Domain Specifications:
• restrict genericity and
• extend domain-specifity
of schema.org.
• Are based on SHACL
• https://schema-tourism.sti2.org/
• We use value restriction not as
inference mechanism but as integrity constraint.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
Schema.org
Domain
Domain Specification
15
3. Knowledge Generation
Vertical extensions of schema.org: The Dach-KG working group
• develops a de facto standard for semantic annotation of touristic content, data, and services in
the DACH area
• based on schema.org and its adaptation by domain specifications
• it should become the backbone of an open 5* Knowledge Graph for touristic data in DACH
*) The dataset gets awarded one star if the data are provided under an open license.
**) Two stars, if the data are available as structured data.
***) Three stars, if the data are also available in a non-proprietary format.
****) Four stars if URIs are used, that the data can be referenced and
*****) five stars, if the data set are linked to other data sets that can provide context.
• It should go online 2021.
https://www.tourismuszukunft.de/2019/05/dach-kg-neue-ergebnisse-naechste-schritte-beim-thema-open-data/
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
16
3. Knowledge Generation
Our Methodology:
• the bottom-up part,
which describes the steps of
the initial annotation process;
• the domain specification
modeling; and
• the top-down part, which
applies the constructed
models.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
17
3. Knowledge Generation
Semantify.it1:
A platform for creating, hosting, validating, verifying, and publishing
schema.org annotated data
• annotation of static data based on schema.org templates
Domain Specifications2
• annotation of different schemata and dynamic data based on
RML3 mappings Rocket RML4
1 https://semantify.it
2 http://ds.sti2.org
3 https://rml.io
4 https://github.com/semantifyit/RocketRML
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
18
3. Knowledge Generation
• Semi-automatic
• Annotation Editor suggests mappings/extracted information
• e.g. extract information from web pages (by HTML tags).
• Use partial NLU to find similarities of the content and schema.org vocabulary.
• Manual adaptions needed to define and to evaluate.
• Instance of the general issues of wrapper generation.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
20
3. Knowledge Generation
• Mapping (more than 95% of the story)
• integrate large and fast changing data sets
• map different formats to the ontology used in our Knowledge Graph
• Various frameworks: XLWrap, Mapping Master (M2), a generic XMLtoRDF tool
providing a mapping document (XML document) that has a link between an XML
Schema and an OWL ontology, Tripliser, GRDDL, R2RML, RML, ...
• We developed an efficient mapping engine for the RDF Mapping
Language RML, called RocketRML. It is a rule based engine that
efficiently processes RML mappings and creates RDF data.
• The semantify.it platform features a wrapper API where these
mappings can be stored and applied to corresponding data.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
21
RML [Dimou et al., 2014]:
● Easier to learn RML than a programming language
● Easy sharing
● Mapping can be visualized
● Mapfiles can be faster to write than code
● Easily change mappings
● Rocket RML precompiles joins to improve performance
by several order of magnitudes.
RML YARRRML Matey
Seite 22
3. Knowledge Generation
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
3. Knowledge Generation
Automatic extraction of knowledge from text representations and web pages
• Tasks
• named entity recognition,
• concept mining, text mining,
• relation detection, …
• Methods
• Information Extraction
• Natural Language Processing (NLP)
• Machine Learning (ML)
• Systems:
• GATE (text analysis & language processing)
• OpenNLP (supports most common NLP tasks)
• RapidMine (data preparation, machine learning, deep learning, text mining, predictive analysis)
• Ontotext / Sirma
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
23
3. Knowledge Generation
Evaluation of semantic annotations:
• The semantify.it validator is a web-tool that offers the possibility to
validate schema.org annotations that are scraped from websites.
• Verification: The annotations are checked against plain schema.org
and against domain specifications
• Validation: The annotations are checked whether they accurately
describe of the content of the web site.
• https://semantif.it/evaluate
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
24
3. Knowledge Generation
• Annotation of dynamic and active data with WASA (earlier called
WSMO).
• Dynamic: Actions to obtain dynamic data (e.g. weather forecast)
• Active: Actions that can be taken on entities in a Knowledge Graph (e.g. a
room offering of a Hotel can have BuyAction attached to it)
• An action is an instance of schema.org/Action type.
• Describe the invocation mechanism (e.g. endpoint, HTTP method, encoding
type).
• Describe input and output parameters with SHACL (another implementation
of domain specifications).
• Grounding and lifting for existing Web APIs.
26
4. Knowledge Hosting
• Semantically annotated data can be serialized to JSON-LD
• storage in document store MongoDB
• native JSON storage
• well integrated in current state of the art software with NodeJS
• performant search, through indexing
• Allows efficient publication of annotations on webpages
• not hardware intensive
no native RDF querying with SPARQL
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
29
4. Knowledge Hosting
• Native storage of semantically annotated data
• RDF store: GraphDB
• very powerful CRUD operations
• named graphs for versioning
• full implementation of SPARQL
• powerful reasoning over big data sets
no web frameworks available
• very hardware intensive
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
30
5. Knowledge Curation
• We defined a simple KR formalism formalizing
essentials of schema.org
• Tbox: isA statements of types, domain and range definitions for properties
(using them globally or locally)
• Abox: isElementOf(I,t) statements, Property-Value Statements p(i1,i2), and
sameAs(i1,i2) statements
• Enables a formal definition of the knowledge curation task (assessment,
cleaning, and enrichment).
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
31
5.1 Knowledge Assesment
• Knowledge Assessment describes and defines the process
of assessing the quality of a Knowledge Graph.
• The goal is to measure the usefulness of a Knowledge Graph.
• Evaluation
• Overall process to determine the quality of a
Knowledge Graph.
• Select quality dimensions, metrics, evaluation functions, and weights for
metrics and dimensions.
• Evaluate representative subsets accordingly.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
32
5.2 Knowledge Cleaning
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
What Verification Validation
Semantic
Annotations
check schema
conformance and
integrity constraints
compare with web
resource
Knowledge Graphs check schema
conformance and
integrity constraints
compare with "real"
world
38
E. Huaman, E. Kärle, D. Fensel: Knowledge Graph Validation, Technical Report. https://arxiv.org/pdf/2005.01389.pdf
5.2 Knowledge Cleaning
Error correction of wrong instance assertions isElementOf (i1,t):
• i is not a proper instance identifier:
Delete assertion or correct i
• t is not an existing type name:
Delete assertion or correct t
• The instance assertion is (semantically) wrong:
• Delete assertion or find proper t
• and do NOT: find a proper i (would neither scale nor making sense)
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
39
5.2 Knowledge Cleaning
Error correction of wrong property value assertions: p(i1,i2):
• p is not a proper property name: Delete assertion or correct p
• i1 is not a proper instance identifier: Delete assertion or correct i1
• i1 is not in any domain of p: Delete assertion or add assertion
isElementOf(i1,t) with t is a domain of p.
• i2 is not a proper instance identifier: Delete assertion or correct i2
• i2 is not in the range of p for any domain of i1:
• Delete assertion or
• add a proper isElementOf assertion for i1 that adds a domain for which i2 is an instance of the range of the property
or
• add a proper isElementOf assertion for i2 that turns it into an instance of a range of the property applied to a domain
of p where i1 is an element.
• The property assertion is (semantically) wrong: delete assertion or correct it. In this case, you
should most likely define proper i2, or search for better p, or search for better i1.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
40
5.2 Knowledge Cleaning
Error correction of wrong equality assertions: isSameAs(i1,i2):
• i1 is not a proper instance identifier: Delete assertion or correct i1
• i2 is not a proper instance identifier: Delete assertion or correct i2
• The identity assertion is (semantically) wrong: Delete assertion or
replace it by a skos operator1.
1 which however does not come with operational semantics.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
41
Knowledge Cleaning: System survey
• Verification:
• Quality Assessment Frameworks such as Luzzu (A Quality Assessment Framework for LinkedOpen Datasets) [Debattista
et al., 2016], Sieve (Linked Data Quality Assessment and Fusion) [Mendes et al., 2012], SWIQA (Semantic Web Information
Quality Assessment Framework) [Fürber & Hepp, 2011], and WIQA (Web InformationQuality Assessment Framework)
[Bizer and Cyganiak, 2009].
• Approaches that check the conformance of RDF graphs against specifications: Alegro GraphTool, RDFUnit [Kontokostas
et al., 2014], SHACL (Shapes Constraint Language) and ShEx (Shape Expressions) [Gayo et al., 2017], Stardog ICV,
TopBraid, and Validata [Hansen et al., 2015].
• Tools that use statistical distributions to predict the types of instances (e.g., SDType [Paulheim & Bizer, 2013]) and to
detect erroneous relationships that connect two resources (e.g., HoloClean [Rekatsinas et al., 2017], SDValidate [Paulheim
& Bizer, 2014]).
• More approaches: KATARA [Chu et al., 2015], LOD Laundromat [Beek et al., 2014].
• Validation:
• Fact validation frameworks: COPAAL (Corroborative FactValidation [Syed et al., 2019]), DeFacto (Deep FactValidation
[Lehmann et al., 2012]), FactCheck [Syed et al., 2018], FacTify [Ercan et al., 2019], Leopard [Speck & Ngonga Ngomo,
2018], Surface [Padia et al., 2018], S3K [Metzger et al, 2011], and TISCO [Rula et al., 2019].
• More approaches based on measuring how accurate is a statement concerning external knowledge sources [Elbassuoni et
al., 2010], [Jia et al., 2019], [Nakamura et al., 2007], [Shi &Weninger, 2016], [Shiralkar et al., 2017], [Wienand & Paulheim,
2014].
42
Knowledge Cleaning: Our approach
• VeriGraph: Verification framework for large Knowledge Graphs. It detects errors
by verifying a Knowledge Graph against a set of given SHACL constraints.
• Verification process: Only the necessary subset of a KG is loaded into the memory
per DS (i.e. a SHACL shape). The constraints are checked on the memory. No one
SPARQL query per constraint component approach.
• Output: Validation report of inconsistencies found (including a human readable
path to the error)
• Status: Evaluation made over a Knowledge Graph of 1billion triples. Currently
tested with SHACL test cases
43
5.3 Knowledge Enrichment
• The goal of knowledge enrichment is to improve the completeness of a
Knowledge Graph by adding new statements
• The process of Knowledge Enrichment has four phases:
• New Knowledge Source detection
• New Knowledge Source integration (URI normalization)
• Duplicate detection and alignment
• Property-Value-Statements correction
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
44
5.3 Knowledge Enrichment
• Knowledge Source detection: search for additional sources of assertions for the
Knowledge Graph
• Open sources
• Closed sources
• Knowledge Source integration
• Tbox: define mappings
• Abox: integrate new assertions into the the Knowledge Graph
• Identifying and resolving duplicates
• Invalid property statements such as domain/range violations and having multiple
values for a unique property, also known in the data quality literature as
contradicting or uncertain attribute value resolution.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
46
Knowledge Enrichment: System survey
• Duplicate Detection:
• Dedupe:[Bilenko & Mooney, 2003] A python library that uses machine learning to find and link duplicates.
• Dude [Draisbach & Naumann, 2010]: Java framework that uses various similarity metrics to compare instances.
• Duke [Garshol & Borge, 2013]: Provides record linkage and deduplication methods, and a genetic algorithm feature to find a tunned
configuration for detecting duplicates.
• Legato [Achichi et al., 2017] A recording linkage tool that utilizes Concise Bounded Description of resources for comparison.
• LIMES [Ngomo & Auer, 2011] A link discovery approach that benefits from the metric spaces (in particular triangle inequality) to reduce the
amount of comparisons between source and target dataset.
• SERIMI [Araújo et al., 2011] A link discovery tool that utilizes string similarity functions on “label properties” without a prior knowledge of data
or schema.
• SILK [Volz et al., 2009] A link discovery tool with declerative linkage rules applying different similarity metrics (e.g. string, taxonomic, set) that
also supports policies for the notification of datasets when one of them publishes new links to others.
• Conflict Resolution:
• FAGI [Giannopoulos et al., 2014] and SlipoToolkit [Athanasiou et al., 2019] are frameworks that suggest fussion strategies for geospatial data
sources.
• KnoFuss [Nikolov et al., 2008] A framework that allows the application of different methods on different attributes in the same dataset for
identification of duplicates and resolves inconsistencies caused by the fusion of linked instances.
• ODCleanStore [Knap et al., 2012]: Allows users to configure conflict resolution policies based on functions (e.g.AVG, MAX).
• Sieve [Mendes et al., 2012]: Provides different fusion functions on selected property values.
47
Knowledge Enrichment: Our approach
• Enrichment Framework: Identifies duplicates in Knowledge Graphs and resolves
conflicting property values.
• Workflow:
• Input: a Knowledge Graph.
• Duplicate Detection Process: semi-automatic feature selection, data
normalization, setup (e.g. similarity metrics), run, and duplicate entities
viewer.
• Resolving Conflicting Property Values: define fusion strategies (e.g. decides
what to do based on similarity values), run, monitoring fusion process.
• Output: Report of duplicate entities found and fused.
• Work-in-progress.
48
6 Knowledge Deployment
• Building, implementing, and curating Knowledge Graphs is a time-
consuming and costly activity.
• Integrating large amounts of facts from heterogeneous information
sources does not come for free.
• [Paulheim, 2018b] estimates the average cost for one fact in a
Knowledge Graph between $0,1 and $6 depending on the amount
of mechanization.
[Paulheim, 2018b] H. Paulheim: How much is a Triple? Estimating the Cost of Knowledge Graph Creation. In ISWC-P&D-
Industry-BlueSky 2018: Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-
located with 17th International Semantic Web Conference (ISWC 2018) Monterey, USA, October 8-12, 2018. http://www.
heikopaulheim.com/docs/iswc_bluesky_cost2018.pdf
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
49
6 Knowledge Deployment
• We build a knowledge access layer on top of the Knowledge Graph helping to connect this resource to
applications.
• Knowledge management technology:
• based on graph‐based repositories host the Knowledge Graph (as a semantic data lake).
• The knowledge management layer is responsible for storing, managing and providing semantic
description of resources
• Inference engines based on deductive reasoning engines:
• implements agents that defines view on this graph together with context data on user requests.
• It accesses the graph to gain data for its reasoning that provides input to the dialogue engine
interacting with the human user.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
51
6 Knowledge Deployment
What are the reasons:
• Scalability issues (Trillions of triples)
• Context refinement for (support different points of view)
• introduce rich constraints (Knowledge Cleaning)
• additional knowledge derivation (Knowledge Enrichment)
• Provide a reusable application layer / middle ware on top of a knowledge graph
• access rights
• integrates additional information sources from the application
• context,
• personalization,
• task etc.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
52
6 Knowledge Deployment Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
53
API
Views
View extractor
Local Knowledge
Enrichment
Local Knowledge
Cleaning
6 Knowledge Deployment Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
54
API
Views
View extractor
Local Knowledge
Enrichment
Local Knowledge
Cleaning
API
Terminology RulesConstraints
View definition View Extraction
Cleaning / Enrichment
Micro TBox
Specifications Engines
7. The Proof Of The pudding Is In The Eating
Knowledge Graphs are enabling technology for:
• Virtual agents (Information search, eMarketing, and eCommerce)
• Cyperphysical Systems (Internet of theThing, Smart Meters, etc.)
• Physical Agents (drones, cars, satellites, androids, etc.)
55
7. Virtual Agents
Onlim
• The pioneer in automating customer communication via AI chatbots and
conversational interfaces
• Enterprise solutions for making data and knowledge available for conversational
interfaces
• Team of 25+ highly experienced AI experts, specialists in semantics and data science
• Spin-off of University of Innsbruck
• HQ in Europe (Vienna, Telfs)
Current FocusVerticals
UtilitiesTourismRetail
Education Financial Services
56
7. Physical Agents: Failures of AI technology
• In May 2016 Joshua Brown was killed by his car because its auto pilot
mixed up a very long car (large wheelbase) with a traffic sign.
60
7. Physical Agents: Failures of AI technology
• In May 2016 Joshua Brown was killed by his car because its auto pilot
mixed up a very long car (large wheelbase) with a traffic sign.
This is what the auto
pilot „saw“
61
7. Physical Agents: Failures of AI technology
• In May 2016 Joshua Brown was killed by his car because its auto pilot
mixed up a very long car (large wheelbase) with a traffic sign.
This is what the auto
pilot „saw“
62
Why had none of the 10,000++ engineers involved not the trivial idea
to connect the car with a Knowledge Graph containing traffic data
that simply knows that there is no traffic sign?
7. Physical Agents: Failures of AI technology
•In May 2016 Joshua Brown was killed by his car
because its auto pilot mixed up a very long car
(large wheelbase) with a traffic sign.
63
7. Physical Agents: Failures of AI technology
• In March 2018 Elaine Herzberg was the first victom of a full autonomously driving
car.
64
7. Physical Agents: Failures of AI technology
• In March 2018 Elaine Herzberg was the first victom of a full autonomously driving
car.
• Besides many software bugs by Uber a la Boeing a core issue was that the car
assumed that petestrians cross streets only on crosswalks.
• Make assumptions explicit and confirm them with a knowledge graph.
• In this case she still would be alive!
65