1. Talking Knowledge Graphs
Dieter Fensel with the help of the entire MindLab team
STI Innsbruck, University of Innsbruck, Austria
May 17, 2019
2. Prerequisite
MindLab:
• MindLab is a self-funded cooperative research project with the
objective to develop methods and software tools for modeling and
implementing scalability for knowledge graphs.
• Partners
2
3. Talking Knowledge Graphs
1. Motivation
2. The Grand Challenges
3. The Crux Of The Matter
4. The Proof Of The Pudding Is In The Eating
5. Key Takeaway
3
6. 2. The Grand Challenges
User
1. understand
Intent
+
Parameters
2. map Query
3. query
Knowledge
Graph
4. Natural
Language
Generation
6
7. 2. The Grand Challenges: Unterstand
NLU
• Voice/Text recognition already quite good
• However require significant manual labor
Manual work
• Design intents based on schema of Knowledge Graph
• Define utterances (example questions) per intent
• Mark parameters that should be extracted from utterances
Automation
• Entity detection: Push entities from Knowledge Graph
• Detect unanswered questions
• Use Knowledge Graph to update/extend NLU:
• create utterances
• supervised-learning: extend utterances with unanswered questions
User
understand
Intent
+
Parameters
map Query
query
Knowledge
1.
2.
3.
NLG1
4.
NLU Knowledge
7
8. 2. The Grand Challenges: Query Generation
• Basis: detected intent & extracted parameters during NLU
• Map extracted information (intent & parameters) on predefined rules
• Query: Combination of rules on SPARQL queries
• Additional restriction rules
• Define a view on a relevant subgraph of the Knowledge Graphs
A Chatbots may not have access to the whole Knowledge Graph
(prevent frillions, inconsistencies, and implements access right restrictions)
User
understand
Intent
+
Parameters
map Query
query
Knowledge
1.
2.
3.
NLG1
4.
Generated
query
Intent
(with
parameters)
Query
generation
Predefined rules
8
9. 2. The Grand Challenges
Querying the Knowledge Graph
• Query is a combination of predefined rules accessing the knowledge through
SPARQL
• Knowledge Graph must provide:
• Large volumes of data
• Integration from heterogeneous resources
• Accessing distributed sources
• Providing dynamic updates (temperature, etc.)
• Defining sub graphs
• Curated in regard to inconsistencies and incompleteness
User
understand
Intent
+
Parameters
map Query
query
Knowledge
1.
2.
3.
NLG1
4.
9
10. 2. The Grand Challenges
Natural Language Generation
Manual work
• Define templates based on
• structure of data
• information that should be given to the user
Automatic
• Generate
• templates out of the Knowledge Graph
• textual answers from the Knowledge Graph
• follow up questions to run dialogs
User
understand
Intent
+
Parameters
map Query
query
Knowledge
1.
2.
3.
NLG1
4.
10
11. 3. The Crux Of The Matter
• The quality of the Intelligent Assistants depends directly on the quality of the
Knowledge Graph
• Problem: “Garbage in Garbage out”
• Requirements for the Knowledge Graph:
• well structured (using an ontology - schema.org)
• accurate information (correctness)
• large and detailed coverage (completeness)
• Timeliness of knowledge
==> Knowledge Graph Lifecycle
11
13. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
14. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
15. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
16. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
17. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
18. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
19. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
20. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
21. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
22. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
23. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 1
24. 3. The Crux Of The Matter: KG Task Model
Knowledge Graph Maintenance
Knowledge
Hosting
Knowledge
Curation
Knowledge
Deployment
Knowledge
Assesment
Knowledge
Cleaning
Knowledge
Enrichement
Error Detection Error Correction
Evaluation Correctness Completeness
Knowledge Source
detection
Knowledge Source
integration
Duplicate
detection
Property-Value-
Statements correction
Knowledge Creation
Edit Semi-automatic AutomaticMapping
13
MindLab Status Year 2 (our dreams)
25. 3. The Crux Of The Matter
Knowledge Generation
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
14
26. 3. The Crux Of The Matter
Knowledge Generation
• https://www.schema.org/
• Started in 2011 by Bing, Google,Yahoo!, andYandex to annotate websites.
• Has become de facto standard.
• We use it for the web site channel as well as for all other channels as an
reference model for our semantic annotations.
• However, we use value restriction not as inference mechanism but as integrity
constraint.
• We define domain specific extensions (that also restrict the genericity of entire
schema.org).
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
15
27. 3. The Crux Of The Matter
Knowledge Generation
• The use of semantic annotations has experienced a tremendous surge in activity since the
introduction of schema.org.
• Schema.org was introduced with 297 classes and 187 relations,
• which over have grown to 598 types, 862 properties, and 114 enumeration values.
• The provided corpus of
• types (e.g. LocalBusiness, SkiResort, Restaurant),
• properties (e.g. name, description, address),
• range restrictions (e.g. Text, URL, PostalAddress),
• and enumeration values (e.g. DayOfWeek, EventStatusType, ItemAvailability)
covers large numbers of different domains, including the tourism domain.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
16
28. 3. The Crux Of The Matter
Knowledge Generation
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
17
29. 3. The Crux Of The Matter
Knowledge Generation
• Domain Specifications:
• restrict generality and
• extend domain-specifity
of schema.org
• Are based on Shacl
• https://schema-tourism.sti2.org/
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
Schema.org
Domain
Domain Specification
18
30. 3. The Crux Of The Matter
Knowledge Generation
Our Methodology:
• the bottom-up part,
which describes the steps of
the initial annotation process;
• the domain specification
modeling; and
• the top-down part, which
applies the constructed
models.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
19
31. 3. The Crux Of The Matter
Knowledge Generation
Manual Annotation Editor
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
20
32. 3. The Crux Of The Matter
Knowledge Generation
• Semi-automatic
• Annotation Editor suggests mappings/extracted information
• e.g. extract information from web pages (by HTML tags).
• Use partial NLU to find similarities of the content and schema.org vocabulary.
• Manual adaptions needed to define and to evaluate.
• Instance of the general issues of wrapper generation.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
21
33. 3. The Crux Of The Matter
Knowledge Generation
• Mapping (more than 95% of the story)
• integrate large and fast changing data sets
• map different formats to the ontology used in our Knowledge Graph
• Various frameworks: XLWrap, Mapping Master (M2), a generic XMLtoRDF tool providing a
mapping document (XML document) that has a link between an XML Schema and an OWL
ontology, Tripliser, GRDDL, R2RML, RML, ...
• We developed a customization of RML, called RocketRML.
• The semantify.it platform features a wrapper API where these
mappings can be stored and applied to corresponding data
sources.
• The wrapper translates the data according to the mappings and
stores it as JSON-LD in a MongoDB.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
22
34. 3. The Crux Of The Matter
Knowledge Generation
Automatic extraction of knowledge from text representations and web
pages
• Tasks
• named entity recognition,
• concept mining, text mining,
• relation detection, …
• Methods
• Information Extraction
• Natural Language Processing (NLP)
• Machine Learning (ML)
• Systems:
• GATE (text analysis & language processing)
• OpenNLP (supports most common NLP tasks)
• RapidMine (data preparation, machine learning, deep learning, text mining, predictive analysis)
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
23
35. 3. The Crux Of The Matter
Knowledge Generation
Evaluation of semantic annotations:
• The semantify.it validator is a web-tool that offers the possibility to
validate schema.org annotations that are scrapped from websites.
• Verification: The annotations are checked against plain schema.org
and against domain specifications
• Validation : The annotations are checked whether they accurately
describe of the content of the web site.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
23
36. 3. The Crux Of The Matter
Knowledge Generation
Evaluation of semantic annotations:
• Notice we take the content of the web site as Golden Standard.
• We do NOT evaluate the accuracy of that content in regard to the
„real“ world.
• We check whether a phone number confirms to the
formal constraints.
• We do not make robocalls to hotels
to check whether the „right“ hotel pick up the phone.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
23
37. 3. The Crux Of The Matter
Knowledge Generation
Evaluation
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
23
38. 3. The Crux Of The Matter
Knowledge Hosting
Semantify.it1):
A platform for creating, hosting, validating, verifying, and publishing
schema.org annotated data
• annotation of static data based on schema.org templates
Domain Specifications2)
• annotation of dynamic data based on
RML mappings RocketRML3)
1) https://semantify.it
2) http://ds.sti2.org
3) https://github.com/semantifyit/RocketRML 24
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Edit Semi-automatic AutomaticMapping
39. 3. The Crux Of The Matter
Knowledge Hosting Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Annotation - Tool
(e.g. semantify.it)
Document store
(e.g. MongoDB)
Graph database
(e.g. GraphDB)
Hosting ...
Semabtic Web
Annotations
25
Knowledge Graphs
40. 3. The Crux Of The Matter
Knowledge Hosting
• Semantically annotated date can be serialized to JSON-LD
• storage in document store MongoDB
• native JSON storage
• well integrated in current state of the art software with NodeJS
• performant search, through indexing
• not hardware intensive
no native RDF querying with SPARQL
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
26
41. 3. The Crux Of The Matter
Knowledge Hosting
• Native storage of semantically annotated data
• RDF store: GraphDB
• very powerful CRUD operations
• named graphs for versioning
• full implementation of SPARQL
• powerful reasoning over big data sets
no web frameworks available
• very hardware intensive
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
27
42. 3. The Crux Of The Matter
Knowledge Curation
• We defined a simple KR formalism formalizing
essentials of schema.org
• Tbox: isA statements of types, domain and range definitions for properties
(using them globally or locally)
• Abox: isElementOf(I,t) statements, Property-Value Statements p(i1,i2), and
sameAs(i1,i2) statements
• Enables a formal definition of the knowledge curation task (assessment,
cleaning, and enrichment).
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
28
43. 3. The Crux Of The Matter
Knowledge Assesment
• Knowledge Assessment describes and defines the process
of assessing the quality of a Knowledge Graph.
• The goal is to measure the usefulness of a Knowledge Graph.
• Evaluation
• Overall process to determine the quality of a
Knowledge Graph.
• Select quality dimensions and metrics (see literature on data quality).
• Evaluate representative subsets accordingly.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
29
44. 3. The Crux Of The Matter
Knowledge Assesment
• Correctness
• Identify the amount of wrong assertions
• Completeness
• Identify missing assertion sets
• Furthers
accessibility, accuracy, appropriate amount, believability, completeness, concise
representation, consistent representation, cost-effectiveness, easy of
manipulating, easy of operation, easy of understanding, flexibility, free-of-error,
interpretability, objectivity, relevancy, reputation, security, timeliness, traceability,
understandability, value-added, and variety
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
30
45. 3. The Crux Of The Matter
Knowledge Assesment
[Paulheim et al., 2019] identify the following subtasks:
• specifying datasets and Knowledge Graphs,
• specifying the evaluation protocol,
• specifying the evaluation metrics,
• specifying the task for task-specific evaluation,
• and defining the setting in terms of intristic vs. task-baed, and automatic versus human-
centric evaluation,
• as well as the need to keep the results reproducible.
H. Paulheim, M. Sabon, M. Choches, and W. Beck: Evaluation of Knowledge Graphs. In P. A. Bonatti, S. Decker, A. Polleres, and V. Presutti:
Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web, Dagstuhl Reports, 8(9):29-111, 2019.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
31
46. 3. The Crux Of The Matter
Knowledge Assesment
Methodologies
• Total Data Quality Management (TDQM) [Wang, 1998] and Data Quality Assessment [Pipino et
al., 2002] allow identifying important quality dimension and their requirements from various
perspectives.
• Other methodologies already defined quality metrics that allow a semi-automatic assessment
based on data integrity constraints. Those are for example User-driven assessment [Zaveri et al.,
2013], Test-driven assessment [Kontokostas et al., 2014] and a manual assessment based on
crowd's experts (Crowdsourcing-driven assessment [Acosta et al., 2013]).
• Besides that, there are quality assessment approaches which use statistical distribution for
measuring the correctness of statements [Paulheim & Bizer, 2014], SPARQL queries for the
identification of functional dependency violations and missing values [Fürber & Hepp, 2010a]
[Fürber & Hepp, 2010b].
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
32
47. 3. The Crux Of The Matter
Knowledge Assesment
Tools and Methods:
• LINK-QA
• using network metrics
• Luzzu (Linked Open Datasets)
• thirty data quality metrics based on Dataset Quality Ontology.
• Sieve
• flexibly expressing quality assessment methods
• fusion methods
• SWIQA (Semantic Web Information Quality Assessment Framework)
• data quality rules & quality scores for identifying wrong data
• Validata
• online tool for testing/validating RDF data against ShEx-schemas
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
33
48. 3. The Crux Of The Matter
Knowledge Assesment
Sleve:
• Sieve for Data Quality Assessment [Mendes et al., 2012] is a framework which
consist of two modules:
• a Quality Assessment module and
• a Data Fusion module
• The Quality Assessment Module involves four steps:
1. Data Quality Indicator allows to define an aspect of a data set that may demonstrate the suitability of it for
intended use. For example, meta-information about the creation of a data set, information about the
provider, or ratings provided by the consumers.
2. Scoring Functions define the assessment of the quality indicator based on its quality dimension. Scoring
functions range from simple comparisons, over set functions, aggregation functions, to more complex
statistical functions, text-analysis, or network analysis methods.
3. Assessment Metric calculates the assessment score based on indicators and scoring functions.
4. Aggregate Metric allows users to aggregate new metrics that can generate new assessment values.
• http://sieve.wbsg.de/
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Evaluation Correctness Completeness
34
49. 3. The Crux Of The Matter
Knowledge Cleaning
• The goal of knowledge cleaning is to improve the correctness of a knowledge
graph
• Major objectives
• error detection and
• error correction of
● wrong instance assertions
● wrong property value assertions
● wrong equality assertions
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
35
51. 3. The Crux Of The Matter
Knowledge Cleaning
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
What Verification Validation
Semantic
Annotations
check schema
conformance and
integrity constraints
compare with web
resource
Knowledge Graphs check schema
conformance and
integrity constraints
compare with "real"
world
37
52. 3. The Crux Of The Matter
Knowledge Cleaning
Error correction of wrong instance assertions isElementOf (i1,i2):
• i is not a proper instance identifier:
Delete assertion or correct i
• t is not an existing type name:
Delete assertion or correct t
• The instance assertion is (semantically) wrong:
• Delete assertion or find proper t
• and do NOT: find a proper i (would neither scale nor making sense)
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
38
53. 3. The Crux Of The Matter
Knowledge Cleaning
Error correction of wrong property value assertions: p(i1,i2):
• p is not a proper property name: Delete assertion or correct p
• i1 is not a proper instance identifier: Delete assertion or correct i1
• i1 is not in any domain of p: Delete assertion or add assertion
isElementOf(i1,t) with t is a domain of p.
• i2 is not a proper instance identifier: Delete assertion or correct i2
• i2 is not in the range of p for any domain of i1:
• Delete assertion or
• add a proper isElementOf assertion for i1 that adds a domain for which i2 is an instance of the range of the property
or
• add a proper isElementOf assertion for i2 that turns it into an instance of a range of the property applied to a domain
of p where i1 is an element.
• The property assertion is (semantically) wrong: delete assertion or correct it. In this case, you
should most likely define proper i2, or search for better p, or search for better i1.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
39
54. 3. The Crux Of The Matter
Knowledge Cleaning
Error correction of wrong equality assertions: isSameAs(i1,i2):
• i1 is not a proper instance identifier: Delete assertion or correct i1
• i2 is not a proper instance identifier: Delete assertion or correct i2
• The identity assertion is (semantically) wrong: Delete assertion or
replace it by a skos operator1.
1 which however does not come with operational semantics.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
40
55. 3. The Crux Of The Matter
Knowledge Cleaning
Methods &Tools:
• HoloClean
● Use of integrity constraints,
● external data,
● quantitative statistics.
● Steps
• separate entry datasets into noisy and clean dataset
• assign uncertainty score over the value of noisy datasets
• compute marginal probability for each value to be repaired.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
41
56. 3. The Crux Of The Matter
Knowledge Cleaning
Methods &Tools:
• HoloClean
● use of integrity constraints,
● external data, and
● quantitative statistics.
● Steps
• separate entry datasets into noisy and clean dataset
• assign uncertainty score over the value of noisy datasets
• compute marginal probability for each value to be repaired
• SDValidate
● uses statistical distribution functions
● three steps:
• compute relative predicate frequency for each statement
• each statement selected in first step -> assign score of confidence
• apply threshold of confidence.
• Similar steps for instance assertions.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
42
57. 3. The Crux Of The Matter
Knowledge Cleaning
Methods & Tools:
• The LOD Laundromat [Beek et al., 2014]
● cleans Linked Open Data
● takes SPARQL endpoint/archived dataset as entry dataset
● guesses the serialisation format
● identifies syntax errors using a library while parsing RDF
● saves RDF data in canonical format
[Beek et al., 2014] W. Beek, L. Rietveld, H. R. Bazoobandi, J. Wielemaker, and S. Schlobach: LOD Laundromat: A Uniform Way of Publishing
Other People’s Dirty Data. In Proceedings of the 13th International Semantic Web Conference (ISWC2014), Springer, LNCS 8796, Riva del
Garda, Italy, October 19-23, 2014.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
43
58. 3. The Crux Of The Matter
Knowledge Cleaning
Methods & Tools:
• The LOD Laundromat [Beek et al., 2014]
● cleans Linked Open Data
● takes SPARQL endpoint/archived dataset as entry dataset
● guesses the serialisation format
● identifies syntax errors using a library while parsing RDF
● saves RDF data in canonical format
• KATARA [Chu et al., 2015]
● identifies correct & incorrect data
● generates possible corrections for wrong data
[Chu et al., 2015] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye: KATARA: reliable data cleaning with knowledge bases
and crowdsourcing. In Proceedings of the 41st International Conference on Very Large Data Bases (PVLDB2015), VLDB Endowment, 8(12),
2015.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
44
59. 3. The Crux Of The Matter
Knowledge Cleaning
Methods & Tools:
• The LOD Laundromat [Beek et al., 2014]
● cleans Linked Open Data
● takes SPARQL endpoint/archived dataset as entry dataset
● guesses the serialisation format
● identifies syntax errors using a library while parsing RDF
● saves RDF data in canonical format
• KATARA [Chu et al., 2015]
● identifies correct & incorrect data
● generates possible corrections for wrong data
• SPIN [Fürber et al., 2010b]
● SPARQL Constraint Language
● generates SPARQL Query templates based on data quality problems
• inconsistency
• lack of comprehensibility
• heterogeneity
• Redundancy
• Nowadays, SPIN has turned into SHACL, a language for validating RDF graphs.
[Fürber & Hepp, 2010b] C. Fürber and M. Hepp: Using semantic web resources for data quality management. In Proceedings of the 17th
International Conference on Knowledge Engineering and Management by the Masses (EKAW2010), Springer, LNCS 6317, Lisbon, Portugal,
October 11-15, 2010.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Error Detection Error Correction
45
60. 3. The Crux Of The Matter
Knowledge Enrichment
• The goal of knowledge enrichment is to improve the completeness of a
knowledge graph by adding new statements
• The process of Knowledge Enrichment has four phases:
• New Knowledge Source detection
• New Knowledge Source integration
• Duplicate detection and alignment
• Property-Value-Statements correction
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
46
61. 3. The Crux Of The Matter
Knowledge Enrichment
• Knowledge Source detection
• search for additional sources of assertions for the KG
• Open sources
• Closed sources
• Knowledge Source integration
• Tbox: define mappings
• Abox: integrate new assertions into the KG
• Identifying and resolving duplicates
• Invalid property statements such as domain/range violations and having multiple values for a
unique property
• also known in the data quality literature as contradicting or uncertain attribute value
resolution.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
47
63. 3. The Crux Of The Matter
Knowledge Enrichment
Methods and tools for duplicate detection and resolution:
• Silk is a framework for achieving entity linking.
• It tackles three tasks:
1. link discovery that defines similarity metrics to calculate a total similarity
value for a pair of entities
2. evaluation of the correctness and completeness of generated links, and
3. a protocol for maintaining data that allows source dataset and target
dataset to exchange generated link sets.
http://silkframework.org/
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
49
64. 3. The Crux Of The Matter
Knowledge Enrichment
Methods and tools for duplicate detection and resolution:
• Legato [Achichi et al., 2017] is a linking tool based on indexing techniques.
• It implements the following steps:
1. data cleaning that filters properties from two input datasets. For example, properties that do not help the
comparison.
2. Instance profiling that creates instance profiles based on Concise Bounded Description for the source.
3. Pre-matching that applies indexing techniques (it takes TF-IDF values), filters such as tokenization and stop-words
removal, and cosine similarity to preselect the entity links.
4. Link repairing that validates each link produced by searching for a link to a target source.
[Achichi et al., 2017] M. Achichi, Z. Bellahsene, and K. Todorov: Lgato results for OAEI 2017. In Proceedings of the 12th International Workshop
on Ontology Matching (OM2017) co-located with the 16th International Semantic Web Conference (ISWC2017). CEUR Workshops, vol. 2032,
Vienna, Austria, October 21, 2017.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
50
65. 3. The Crux Of The Matter
Knowledge Enrichment
Methods and tools for duplicate detection and resolution:
• SERIMI [Araujo et al., 2011] tries to match instances between two datasets.
• It has three steps:
• property selection, allows users to select relevant properties from source dataset,
• the selection of candidates from a target dataset, uses string matching of properties
to select a set of candidates, and
• the disambiguation of candidates, measures the similarity for each candidate
applying a contrast model, which returns a degree of confidence.
• ADEL, Duke, Dedupe, LIMES, ...
[Araujo et al., 2011] S. Araujo, J. Hidders, D. Schwabe, and A. P. de Vries: SERIMI - Resource Description Similarity, RDF Instance Matching and Interlinking.
In Proceedings of the 6th International Workshop on Ontology Matching (OM2011), CEUR Workshop, vol. 814, Bonn, Germany, October 24, 2011.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
51
66. 3. The Crux Of The Matter
Knowledge Enrichment
Property-Value-Statements correction:
• KnoFuss allows data fusion using different methods.
• The workflow of KnoFuss is as follows:
1. It receives a dataset to be integrated into the target dataset,
2. It performs co-referencing using a similarity method, detects conflicts
utilizing ontological constraints, and resolve inconsistencies
3. It produces a dataset to be integrated into the target dataset.
• http://technologies.kmi.open.ac.uk/knofuss/
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
52
67. 3. The Crux Of The Matter
Knowledge Enrichment
Property-Value-Statements correction:
• ODCleanStore [Michelfeit & Necaský, 2012] is a framework for cleaning, linking, quality
assessment, and fusing RDF data.
• The fusion module allows users to configure conflict resolution strategies based on
provenance and quality metadata. e.g. :
1. an arbitrary value, ANY, MIN, MAX, SHORTEST or LONGEST is selected from the
conflicting values,
2. computes AVG, MEDIAN, CONCAT of conflicting values,
3. the value with the highest (BEST) aggregate quality is selected,
4. the value with the newest (LATEST) time is selected, and
5. ALL input values are preserved.
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
53
68. 3. The Crux Of The Matter
Knowledge Enrichment
Property-Value-Statements correction:
• Sieve [Mendes et al., 2012], is a framework that consists of two modules; a Quality assessment module and
a Data Fusion module.
• The Data Fusion module describes various fusion policies that are applied for fusing conflicting values.
• FAG, FuSem, MumMer, …
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
KnowledgeAssessment KnowledgeCleaning Knowledge Enrichement
Knowledge Source detection Knowledge Source integration Duplicate detection
Property-Value-
Statements correction
54
69. 3. The Crux Of The Matter
Knowledge Deployment
• Building, implementing, and curating Knowledge Graphs is a time-
consuming and costly activity.
• Integrating large amounts of facts from heterogeneous information
sources does not come for free.
• [Paulheim, 2018b] estimates the average cost for one fact in a
Knowledge Graph between $0,1 and $6 depending on the amount
of mechanization.
[Paulheim, 2018b] H. Paulheim: How much is a Triple? Estimating the Cost of Knowledge Graph Creation. In ISWC-P&D-
Industry-BlueSky 2018: Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-
located with 17th International Semantic Web Conference (ISWC 2018) Monterey, USA, October 8-12, 2018. http://www.
heikopaulheim.com/docs/iswc_bluesky_cost2018.pdf
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
55
71. 3. The Crux Of The Matter
Knowledge Deployment
• We build a knowledge access layer on top of the Knowledge Graph helping to connect this resource to
applications.
• Knowledge management technology:
• based on graph‐based repositories host the Knowledge Graph (as a semantic data lake).
• The knowledge management layer is responsible for storing, managing and providing semantic
description of resources
• Inference engines (SemBase) based on deductive reasoning engines:
• implements agents that defines view on this graph together with context data on user requests.
• It accesses the graph to gain data for its reasoning that provides input to the dialogue engine
interacting with the human user.
• Reasons:
• Help to implement access rights, bypass inconsistencies and frillions
• Integrates additional information sources from the application (context, personalization, task etc.)
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
57
72. 3. The Crux Of The Matter
Knowledge Deployment
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Input MongoDB Semantify.it
editing
crawling
mapping
Storage GraphDB Hosting the Knowledge Graph
Output
Views
Reasoning Agent
Reasoning Agent
Reasoning Agent
58
73. 3. The Crux Of The Matter
Knowledge Deployment
Knowledge Graph
Knowledge Creation Knowledge Hosting Knowledge Curation Knowledge Deployment
Knowledge Infrastructure
Generic Application Layer Conversational Interfaces
59
74. 4. The Proof Of The pudding Is In The Eating
Onlim
• The pioneer in automating customer communication via AI chatbots and
conversational interfaces
• Enterprise solutions for making data and knowledge available for conversational
interfaces
• Team of 25+ highly experienced AI experts, specialists in semantics and data science
• Spin-off of University of Innsbruck
• HQ in Europe (Vienna, Telfs)
Current FocusVerticals
60
UtilitiesTourismRetail
Education Financial Services
75. 4. The Proof Of The pudding Is In The Eating
Onlim
61
76. 4. The Proof Of The pudding Is In The Eating
• The Chatbot market is expected to grow from its current market value (2018) of more than $250 million to over $1.34
billion by 2024.
• The growth is due to the evolving usage of chatbots for content marketing activities such as digital marketing and
advertising.
• With the rise of Artificial Intelligence (AI) and conversational user interfaces, we are increasingly likely to interact with a
bot than ever before.
• Businesses are following customers onto messaging platforms. 90% of businesses use Facebook to respond to service
requests.
• But also the transfer from social towards conversational interfaces is impressing. Bots on Facebook messenger can
tremendously help businesses in dealing with that issue.
• https://www.sdcexec.com/software-technology/news/21011880/chatbot-market-to-grow-at-31-percent-cagr-from-2018-to-2024
• https://www.gartner.com/smarterwithgartner/gartner-predicts-a-virtual-world-of-exponential-change/
• https://www.businessinsider.in/tech/data-a-massive-hidden-shift-is-driving-companies-to-use-a-i-bots-inside-facebook-messenger/slidelist/52240155.cms
62
77. 4. The Proof Of The pudding Is In The Eating
• In 2017, 20 % of the web searches were conducted via voice assistants.
• Artificial intelligence-based voice assistance (AI-voice) will soon be a primary user interface for all digital
devices – including smartphones, smart speakers, personal computers, automobiles, and home appliances.
• As of mid-January 2019, more than 1 billion devices worldwide were equipped with Google’s AI-voice
Assistant, and another hundred million devices spoke with Amazon’s Alexa – and neither number accounts
for devices equipped with voice assistants from Apple, Microsoft, Samsung, or across the digital worlds of
China and Asia.
• Juniper Research forecasts the global market for voice assistants to grow at a 25.4 percent CAGR over the
next five years, with 8 billion active voice assistants (across all platforms and devices) by 2023.
https://voicebot.ai/2019/01/07/google-assistant-to-be-available-on-1-billion-devices-this-month-10x-more-than-alexa/
https://www.juniperresearch.com/press/press-releases/digital-voice-assistants-in-use-to-triple
63
78. 4. The Proof Of The pudding Is In The Eating
• Chatbots and Voice Assistants have started to play an increasing role in customer communication for many
business in various verticals.
• Especially in tourism they are proving more and more benefits in terms of convenience, availability, and fast
access to information delivery and customer support through the entire customer journey.
• In the dreaming and planning phase hotels and Destination Management Organizations (DMOs) can
provide information through Chatbots and Voice Assistants about the hotel and/or the region, the
surroundings, and weather conditions to potential guests.
• In the booking phase, from booking the hotel and transport to buying connected services, e.g. ski
tickets, all becomes much simpler and efficient by using natural language.
• Finally in the experiences phase, Chatbots and Voice Assistants can also announce special offers or
events. All requested information and processes are available 24/7/365 and instantly. For hotels guests
in particular, the stay experience can be enriched by providing them access to hotel services and
beyond.
64
79. 4. The Proof Of The pudding Is In The Eating
• ATouristic Knowledge Graph integrates and connects data from several sources including:
• touristic data sources:
• open data sources:
• It includes entities of the following types:
• LocalBusiness
• POIs, Infrastructure
• SportsActivityLocations (e.g.Trails, SkiResorts)
• Events
• Offers
• WebCams
• Mobility andTransport
65
80. 4. The Proof Of The pudding Is In The Eating
SkiRouteCableCar
Slope
SkiResort
Touristic Knowledge Graph excerpt
SkiResort, Lifts, Slopes, WebCams
ChairLift
WebCam
Data Visualisation
(based on GraphDB)
containedInPlace
SkiLift
TBar
SnowReport
subClassOf
containedInPlace
66
81. 4. The Proof Of The pudding Is In The Eating
The Touristic KG is used to
answer questions such as:
• “Where can I have a
traditional Tyrolean food
when going cross country
skiing?”
• “Show me WebCams near
Kölner Haus”
• “How many people are
leaving in Serfaus?”
67
82. 4. The Proof Of The pudding Is In The Eating
The Dach-KG working group
• develops a de facto standard for semantic annotation of touristic content, data, and services in
the DACH area
• based on schema.org and its adaptation by domain specifications
• it should become the backbone of an open 5* Knowledge Graph for touristic data in DACH
*) The dataset gets awarded one star if the data are provided under an open license.
**) Two stars, if the data are available as structured data.
***) Three stars, if the data are also available in a non-proprietary format.
****) Four stars if URIs are used, that the data can be referenced and
*****) five stars, if the data set are linked to other data sets that can provide context.
https://www.tourismuszukunft.de/2019/05/dach-kg-neue-ergebnisse-naechste-schritte-beim-thema-open-data/
68
83. 4. The Proof Of The pudding Is In The Eating
Members of the Dach-KG working group
• Touristic experts from the DACH-region (Germany (D), Austria (A), Switzerland (CH)) and Italy
(South-Tyrol)
• the Austrain and German touristic associations,
• LTOs (Tirol, Vorarlberg, Wien, Brandenburg, Thüringen, …)
• Associated: DMOs (Mayrhofen, Seefeld, …)
• STI Innsbruck and STI International
• Planned is an extension by technology providers
(Datacycle, Feratel, Hubermedia, infomax, LandinSicht, Onlim, Outdooractive, TSO, ...)
69
84. 4. The Proof Of The pudding Is In The Eating
We build the Tyrol Knowledge Grapgh (TKG) as a nucleus for this innitiative
• It is a five star linked open data set published in GraphDB providing a SPARQL endpoint for the provisioning
of touristic data of Tyrol, Austria.
• The TKG currently contains data about touristic infrastructure like accommodation businesses, restaurants,
points of interests, events, recipes, et. The data of the TKG fall under three categories of data:
• Static data is information which is rarely changing like the address of a hotel.
• Dynamic data is fast changing information, like availabilities and prices.
• Active data describe actions that can be executed, for example, the description of a purchase- or
reservation.
• At November 25, 2018, the TKG contained around 5 billion statements, of which 55% are explicit and 45%
are inferred. Every day the Knowledge Graph grows by around 8 million statements.
• http://graphdb.sti2.at:8080/
70
85. 4. The Proof Of The pudding Is In The Eating
There is a world beyond leasurement:
71
UtilitiesTourismRetail
Financial ServicesEducation
Understand information needs and goals of the users (Natural Language Understanding)
Design intents
train NLU (scaling), especially entity detection
Mapping Intent & Parameters to create a query for accessing the KG
Querying the Knowledge Graph
Defining views
KG integrates large volumes of heterogenious, distributed, dynamic, and potentially inconsistent statements
Natural Language Generation to present the result to the user
If it would work, we would not need it.
Sub graph consistent
Data Lake
[Acosta et al., 2013] M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, S. Auer, and J. Lehmann: Crowdsourcing linked data quality assessment. In Proceedings of the 12th International Semantic Web Conference (ISWC2013), Springer, LNCS 8219, Sydney, Australia, October 21-25, 2013.
[Fürber & Hepp, 2010a] C. Fürber and M. Hepp: Using SPARQL and SPIN for data quality management on the semantic web. In Proceedings of the 13th International Conference on Business Information Systems (BIS2010), Springer, LNBIP 47, Berlin, Germany, May 3-5, 2010.
[Fürber & Hepp, 2010b] C. Fürber and M. Hepp: Using semantic web resources for data quality management. In Proceedings of the 17th International Conference on Knowledge Engineering and Management by the Masses (EKAW2010), Springer, LNCS 6317, Lisbon, Portugal, October 11-15, 2010.
[Kontokostas et al., 2014] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. Zaveri: Test-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14), ACM, Seoul, Korea, April 07-11, 2014.
[Paulheim & Bizer, 2014] H. Paulheim and C. Bizer: Improving the Quality of Linked Data Using Statistical Distributions, International Journal on Semantic Web and Information Systems (IJSWIS), 10(2):63-86, 2014.
[Pipino et al., 2002] L. L. Pipino, Y. W. Lee, and R. Y. Wang: Data Quality Assessment, Communications of the ACM, 45(4), 2002.
[Wang, 1998] R. Y. Wang: A Product Perspective on Total Data Quality Management, Communication of the ACM, 4(2), 1998.
[Zaveri et al., 2013] A. Zaveri, D. Kontokostas, M. A. Sherif, L. Bühmann, M. Morsey, S. Auer, and J. Lehmann: User-driven quality evaluation of DBpedia. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS '13), ACM, Graz, Austria, September 04 - 06, 2013.
LINK-QA
C. Guéret, P.T. Groth, C. Stadler, and J. Lehmann: Assessing linked data mappings using network measures. In Proceedings of the 9th Extended Semantic Web Conference: Research and Applications (ESWC2012), Springer, LNCS 7295, Heraklion, Greece, May 27-31, 2012.
Luzzu (A Quality Assessment Framework for Linked Open Datasets)
J. Debattista, S. Auer and C. Lange: Luzzu: A Methodology and Framework for Linked Data Quality Assessment, Journal of Data and Information Quality (JDIQ), 8(1), 2016.
Sieve
P. N. Mendes, H. Mühleisen, and C. Bizer: Sieve: Linked Data Quality Assessment and Fusion. In Proceedings of the Second International Workshop on Linked Web Data Management (LWDM 2012), in conjunction EDBT2012, Berlin, Germany, March 30, 2012.
SWIQA (Semantic Web Information Quality Assessment Framework)
C. Fürber and M. Hepp: SWIQA - a semantic web information quality assessment framework. In Proceedings of the 19th European Conference on Information Systems (ECIS2011), Association for Information Systems Electronic Library, ECIS 76, Helsinki, Finland, June 9-11, 2011. https://aisel.aisnet.org/ecis2011/76
Validata
J.B. Hansen, A. Beveridge, R. Farmer, L. Gehrmann, A.J.G. Gray, S. Khutan, T. Robertson, and J. Val: Validata: An online tool for testing RDF data conformance. In Proceedings of the 8th International Conference on Semantic Web Applications and Tools for Life Sciences (SWAT4LS2015), CEUR Workshop Proceedings, vol. 1546, Cambridge, UK, December 7-10, 2015.
SCORING FUNCTION
TimeCloseness: measures the distance from the input date (obtained from the input metadata through a path expression) to the current (system) date. Dates outside the range (informed in number of days) receive value 0, and dates that are more recent receive values closer to 1.
Preference: assigns decreasing, uniformly distributed, real values to each graph URI provided as space-separated list.
SetMembership: assigns 1 if the value of the indicator provided as input belongs to the set informed as parameter, 0 otherwise.
Threshold: assigns 1 if the value of the indicator provided as input is higher than a threshold informed as parameter, 0 otherwise.
IntervalMembership: assigns 1 if the value of the indicator provided as input is within the interval informed as parameter, 0 otherwise.
[Mendes et al., 2012]P. N. Mendes, H. Mühleisen, and C. Bizer: Sieve: Linked Data Quality Assessment and Fusion. In Proceedings of the Second International Workshop on Linked Web Data Management (LWDM 2012), in conjunction EDBT2012, Berlin, Germany, March 30, 2012.
HoloClean [Rekatsinas et al., 2017] uses various approaches such as integrity constraints, external data, and quantitative statistics, to detect errors. HoloClean’s workflow follows three steps:
First, HoloClean takes a dataset, along with a set of methods (such as denial constraints, outlier detection, external dictionaries or labeled data) for detecting erroneous data. It separates entry datasets into a noisy and clean dataset.
Second, HoloClean assigns an uncertainty score over the value of noisy datasets, which is based on a probabilistic model generated using DDlog program.
Third, HoloCLean computes a marginal probability for each value to be repaired, which means the confident about this repair.
[Rekatsinas et al., 2017] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré: HoloClean: Holistic data repairs with probabilistic inference. In Proceedings of the Very Large Data Bases Endowment (PVLDB), VLDB Endowment,10(11), 2017.
SDValidate [Paulheim & Bizer, 2014] uses statistical distributions to assess (assigning a confidence score to) the correctness of statements. It involves three main steps:
First, it computes the relative predicate (predicate/object combination) frequency for each statement. For example, statements with a low frequency are selected for a detailed analysis.
Second, for each statement selected in the first step SDValidate uses the statistical distributions of properties and types (predicate’s subject/object combination) to assign a score of confidence to each statement.
Third, SDValidate applies a threshold of confidence above which statements are considered to be true.
Similarly, there exist SDType which applies statistical distributions for detecting type assertion errors.
[Paulheim & Bizer, 2014] H. Paulheim and C. Bizer: Improving the Quality of Linked Data Using Statistical Distributions, International Journal on Semantic Web and Information Systems (IJSWIS), 10(2):63-86, 2014.
KATARA identifies correct and incorrect data and generates possible corrections for wrong data. Basically, KATARA involves three steps.
First, KATARA allows the user to select the target data table and the trusted knowledge base.
Second, KATARA identifies the pattern (coherence of types and relationships) of the target data in the trusted knowledge base, and the user validates the pattern.
Third, KATARA annotates each value and tuple (pair of values) as correct if they have the type and relations in the trusted knowledge base respectively, or contrary as incorrect.
For example, missing datatype properties, functional dependency violations, mistyping errors, unique value violation.
[Volz et al., 2009] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov: Discovering and Maintaining Links on the Web of Data. In Proceedings of the 8th International Semantic Web Conference (ISWC 2009), Washington, DC, Springer, LNCS 5823, October 25-29, 2009.
[Nikolov et al., 2008] A. Nikolov, V. Uren, E. Motta, and A. de Roeck: KnoFuss: Integration of Semantically Annotated Data by the KnoFuss Architecture. In Proceedings of the 16th International Conference on Knowledge Engineering and Knowledge Management (EKAW2008), Springer, LNCS 5268, Acitrezza, Italy, September 29 - October 2, 2008.
[Michelfeit & Necaský, 2012] J. Michelfeit and M. Necaský: Linked open data aggregation: Conflict resolution and aggregate quality. In Proceedings of the 36th Annual IEEE Computer Software and Applications Conference Workshops (COMPSAC2012), IEEE, Izmir, Turkey, July 16-20, 2012.
Fusion describes the name and description of a data fusion policy. e.g. name="Fusion strategy for DBpedia City Entities".
Class defines a subset of the input that belongs to a given class, e.g. Class name="dbpedia:City".
Property defines a property where a FusionFunction is applied. e.g. Property name="dbpedia:areaTotal"
FusionFunction specifies the FusionFunction class used to fuse for a given property. e.g. FusionFunction class="KeepValueWithHighestScore" metric="sieve:lastUpdated".
P. N. Mendes, H. Mühleisen, and C. Bizer: Sieve: Linked Data Quality Assessment and Fusion. In Proceedings of the Second International Workshop on Linked Web Data Management (LWDM 2012), in conjunction EDBT2012, Berlin, Germany, March 30, 2012.