The presentation of my public PhD defense on March 10, 2022. The related video is available at https://www.youtube.com/watch?v=NofQSwc3Svk
This doctoral thesis tackles the support of users when assessing, creating and using Knowledge Graph restrictions.
More concretely, in this dissertation the FAIR Montolo statistics are contributed, supporting users in assessing existing Knowledge Graphs based on used restrictions.
The two visual notations ShapeUML and ShapeVOWL are presented and evaluated: they represent all constraint types of the Shapes Constraint Language (SHACL) and thus advance the state of the art.
Finally, the use of restrictions to represent formal meaning and to assess data quality is demonstrated for a social media archiving use case in the BESOCIAL project of the Royal Library of Belgium (KBR).
Assessing, Creating and Using Knowledge Graph Restrictions
1. Assessing, Creating and Using
Knowledge Graph Restrictions
Sven Lieber, supervised by Anastasia Dimou and Ruben Verborgh
10.03.2022 - public PhD defense
2. Assessing, Creating and Using
Knowledge Graph Restrictions
Sven Lieber, supervised by Anastasia Dimou and Ruben Verborgh
10.03.2022 - public PhD defense
?
3. Assessing, Creating and Using
? ? ?
Sven Lieber, supervised by Anastasia Dimou and Ruben Verborgh
10.03.2022 - public PhD defense
?
5. This PhD is about information processing
Telescope science?
=> Astronomy!
Microscope science?
=> Biology!
Computer science?
=> Information!
“Computer science involves
the study of or the practice of
computation, automation,
and information” - Wikipedia
8. “24 hours in photos”, 2011 from Erik Kessels
350k printed images uploaded to Flickr in a single day
Large amount of unconnected data
What is on these two images,
and are they connected somehow?
9. What is what? We need semantics!
“a separate seat for one person,
typically with a back and four legs.”
- Oxford Languages
“the person in charge of a meeting or
of an organization (used as a neutral
alternative to chairman or
chairwoman)” - Oxford Languages
Please think of “a chair”
10. Different definitions and understanding about data
Data Silo 1 Data Silo 3
Data Silo 2
A person is alive, has a
first and last name and
has a residence address
A person is real or
fictional
A person is the user of
the app identified via
an Email address
?
?
?
How many persons can we
reach with our marketing
campaign in Ghent?
11. Data and data modeling using a graph
Person
Sven
Organization
University
Supervisor
Anastasia
Ruben
is subclass
is a
is subclass is subclass
UGent
is a
is enrolled at
knows
knows
is a
is a
A Knowledge Graph
(i) real world entities in a graph structure
(ii) classes and relations in a schema
(iii) linking of arbitrary entities
(iv) covers various topical domains
“Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods”, Semantic Web
Journal, 2016, Heiko Paulheim
PhD
student
knows
is enrolled at
12. Link data in a flexible way
Person
Sven
Organization
University
Supervisor
Anastasia
Ruben
is subclass
is a
is subclass is subclass
UGent
is a
is enrolled at
knows
knows
is a
is a
A Knowledge Graph
(i) real world entities in a graph structure
(ii) classes and relations in a schema
(iii) linking of arbitrary entities
(iv) covers various topical domains
“Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods”, Semantic Web
Journal, 2016, Heiko Paulheim
PhD
student
knows
is enrolled at
13. Express the data model in a flexible way
Person
Sven
Organization
University
Supervisor
Anastasia
Ruben
is subclass
is a
is subclass is subclass
UGent
is a
is enrolled at
knows
knows
is a
is a
A Knowledge Graph
(i) real world entities in a graph structure
(ii) classes and relations in a schema
(iii) linking of arbitrary entities
(iv) covers various topical domains
“Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods”, Semantic Web
Journal, 2016, Heiko Paulheim
PhD
student
knows
is enrolled at
14. A uniform graph representation
Person
Sven
Organization
University
Supervisor
Anastasia
Ruben
is subclass
is a
is subclass is subclass
UGent
is a
is enrolled at
knows
knows
is a
is a
A Knowledge Graph
(i) real world entities in a graph structure
(ii) classes and relations in a schema
(iii) linking of arbitrary entities
(iv) covers various topical domains
“Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods”, Semantic Web
Journal, 2016, Heiko Paulheim
PhD
student
knows
is enrolled at
15. Data integration because of reused definitions of things
Data Silo 1 Data Silo 3
Data Silo 2
A person is alive, has a
first and last name and
has a residence address
A person is real or
fictional
A person is the user of
the app identified via
an Email address
“A vocabulary defines the concepts
and relationships describing an area
of concern” - World Wide Web
Consortium (W3C)
16. Crash course about the context
-> Represent data in a uniform graph structure
PhD presentation
17. But how can this be used by a computer?
Data Silo 1 Data Silo 3
Data Silo 2
A person is alive, has a
first and last name and
has a residence address
A person is real or
fictional
A person is the user of
the app identified via
an Email address
18. Keep the flexible graph representation
in a computer readable text format by using “triples”
Person is a Class .
PhD student is subclass Person .
Supervisor is subclass Person .
University is subclass Organization .
Anastasia is a Supervisor .
Ruben is a Supervisor .
Sven is a PhD Student .
UGent is a University .
Sven is enrolled at UGent .
19. Reuse the web as global information system
Person is a Class .
http://xmlns.com/foaf/0.1/Person
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#Class .
foaf:Person rdf:type rdfs:Class .
20. Reuse the web as global information system
-> reuse of definitions for shared understanding
-> link to existing data
foaf:Person rdf:type rdf:Class .
ex:PhdStudent rdfs:subClassOf foaf:Person .
ex:Supervisor rdfs:subClassOf foaf:Person .
ex:University rdfs:subClassOf foaf:Organization .
data:anastasia rdf:type ex:Supervisor .
data:ruben rdf:type ex:Supervisor .
data:sven rdf:type ex:PhDStudent .
data:ugent rdf:type ex:University .
data:sven ex:enrolledAt data:ugent .
data:sven foaf:givenName “Sven” .
data:sven foaf:familyName “Lieber” .
21. Crash course about the context
-> Data in a uniform graph structure
-> Use the web to represent the graph
PhD presentation
22. This does not seem right to us …
UGent Sven
is enrolled at
wroteBook
Train 123
… but okay for a computer
because we did not restrict possible links
23. Let’s talk about semantics … again
“a separate seat for one person,
typically with a back and four legs.”
- Oxford Languages
“the person in charge of a meeting or
of an organization (used as a neutral
alternative to chairman or
chairwoman)” - Oxford Languages
24. Let’s talk about semantics … again
“a separate seat for one person,
typically with a back and four legs.”
- Oxford Languages
“the person in charge of a meeting or
of an organization (used as a neutral
alternative to chairman or
chairwoman)” - Oxford Languages
25. We can distinguish now between different things
“a separate seat for one person,
typically with a back and four legs.”
- Oxford Languages
“the person in charge of a meeting or
of an organization (used as a neutral
alternative to chairman or
chairwoman)” - Oxford Languages
26. Without restrictions a computer cannot differentiate
Sven
knows
?
Domain and Range axioms: “knows”
connects two instances of class Person
Axioms are “statements that are asserted to be true in the
domain being described” - OWL2 Structural Specification
and Functional-Style Syntax, W3C 2012
27. Provide formal meaning using axioms which supports
inferring new knowledge
Sven
knows
Domain and Range axioms: “knows”
connects two instances of class Person
Person
is a
is a
new “is a” relationships inferred!
28. What can be inferred here?
Sven
knows
4
has legs
?
Axiom: something with 4 legs is a chair
29. Ups, we created a Person-Chair
Sven
knows
4
has legs
Person
Chair
Axiom: something with 4 legs is a chair
is a
is a
is a
new “is a” relationships inferred!
30. Use constraints to define what is valid
Data shapes express “structural constraints to
validate instance data” - SHACL Use Cases and
Requirements, W3C 2017
Person
Birth date
Last name
First name
For example: persons need a birth
date, last name and first name
31. Vocabulary -> Ontology
“An Ontology is a formal, explicit
specification of a shared
conceptualization” - Thomas R. Gruber
(1993)
Person
Organization
University
Supervisor
is subclass
is subclass is subclass
PhD
student
knows
is enrolled at
“A Conceptualization is an intensional
semantic structure which encodes the
implicit rules constraining the
structure of a piece of reality” -
Guarino et al. (1995)
“The OWL 2 RDF-Based Semantics gives a formal meaning to
every RDF graph” - OWL2 RDF-based Semantics, W3C 2012
32. The use of restrictions varies in practice
only subclasses Different restrictions
defining formal
meaning
structured metadata in
websites using schema.org
Neuro Behavior
ontology (NBO)
A program to infer knowledge
(a reasoner) needs formal meaning
33. Crash course about the context
-> Data in a uniform graph structure
-> Use the web to represent the graph
-> We can restrict meaning using axioms or restrict what is
valid using constraints
PhD presentation
34. Crash course about the context
PhD presentation
Congratulations, you passed Knowledge Graphs 101
35. Assessing, Creating and Using
Knowledge Graph Restrictions
Sven Lieber, supervised by Anastasia Dimou and Ruben Verborgh
10.03.2022 - public PhD defense
36. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
37. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
38. Imagine you want to create an application (data model)
Reuse existing concepts which fit your use case
for example an event planning app
39. Reusing ontologies is usually a multi-step process
Discovery of reuse
candidates
Selection of
relevant
ontologies
Customization
and integration
of reused
ontologies
40. Imagine you want to create an application (data model)
Reuse existing concepts which fit your use case
for example an event planning app
Create your own local constraints
for example Corona measures which temporarily apply
41. Creating constraints
Person
Birth date
Last name
First name
USER
schema:DatedMoneySpecification
rdf:type sh:NodeShape ;
sh:closed "true"^^xsd:boolean ;
sh:ignoredProperties (
rdf:type
) ;
sh:property [
sh:path schema:amount ;
sh:datatype xsd:float ;
sh:maxCount 1 ;
sh:minCount 1 ;
] ;
sh:property [
sh:path schema:currency ;
rdfs:comment "The currency code (here) is a
mandatory property consisting of three upper-case
letters" ;
sh:datatype xsd:string ;
sh:flags "i" ;
sh:maxCount 1 ;
sh:minCount 1 ;
sh:pattern "^[A-Z]{3}$" ;
] ;
What users get!
What users want
is visual support!
42. Main research question
How can we support users in the assessment and in the
creation of Knowledge Graph restrictions?
43. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
44. The use of restrictions varies in practice
only subclasses Different restrictions
defining formal
meaning
structured metadata in
websites using schema.org
Neuro Behavior
ontology (NBO)
A program to infer knowledge
(a reasoner) needs formal meaning
45. Different types of restrictions are available in RDFS/OWL
only subclasses Different restrictions
defining formal
meaning
Domain
Disjoint
Properties
Literal ranges
Reasoner
46. only subclasses
Different restrictions
defining formal
meaning
But some restriction types come with a high
(computational) complexity … not always needed
Domain
Disjoint
Properties
Literal ranges
Reasoner
47. Reusing ontologies is usually a multi-step process
Discovery of reuse
candidates
Selection of
relevant
ontologies
Customization
and integration
of reused
ontologies
49. Does this vocabulary fit our use case?
Existing statistics do not provide any
information of what restrictions exist
in the vocabulary
50. Currently only a manual assessment
of ontologies, one by one
Ontology documentation pages
created by Widoco
Ontology loaded into the editor tool
Protégé
51. Discover and assess ontologies based on restriction use
Possible ontology reuse candidates
(colors = different restriction
types)
Use case
52. Discover and assess ontologies based on restriction use
Possible ontology reuse candidates
(colors = different restriction
types)
Restriction type
use statistics Use case
54. Created statistics are FAIR
The statistics are described using
Knowledge Graphs
Dataset available via a repository
or consultable via a website
55. How many ontologies use each restriction type?
A few often used
restriction types and a
long tail both
in LOV and BioPortal
Restriction types
56. Negligible number of literal value restrictions
Almost no literalRanges
restrictions
literalPattern not used
at all
57. Property and cardinality restrictions in the tail
Tail mostly consists of
property-based and
cardinality-based
restrictions expressed
using OWL terms
58. LOV vs BioPortal: qualified cardinalities
Qualified cardinalities
preferred in BioPortal
ontologies
59. LOV vs BioPortal: unqualified cardinalities
Unqualified cardinalities
preferred in LOV
ontologies
61. Domain and range used less in BioPortal
More domain/range
Restrictions in LOV
62. Commonly used constraint types and unused potential
Data shapes are relatively
new, here we could only
investigate 19 data sources
63. Besides assessment support we learned from the statistics
and we can ask more questions
Only half of the ontologies use OWL-based axioms
Little attention for literal values
Attention with editing tools regarding a self fulfilling prophecy
64. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
65. Creating constraints
USER
schema:DatedMoneySpecification
rdf:type sh:NodeShape ;
sh:closed "true"^^xsd:boolean ;
sh:ignoredProperties (
rdf:type
) ;
sh:property [
sh:path schema:amount ;
sh:datatype xsd:float ;
sh:maxCount 1 ;
sh:minCount 1 ;
] ;
sh:property [
sh:path schema:currency ;
rdfs:comment "The currency code (here) is a
mandatory property consisting of three upper-case
letters" ;
sh:datatype xsd:string ;
sh:flags "i" ;
sh:maxCount 1 ;
sh:minCount 1 ;
sh:pattern "^[A-Z]{3}$" ;
] ;
What users want
is visual support!
What users get!
66. Different constraint types need to be visualized
USER
Or
Disjoint
Not
What users want
is visual support!
67. Existing tools do not specify how to visualize
all SHACL core constraints
USER
Or
Disjoint
Existing visual tools
68. Based on existing cognitive theories and experiments
we can define how to systematically visualize constraint types
USER
Or
Disjoint
Moody, Daniel. "The ‘physics’ of
notations: toward a scientific basis for
constructing visual notations in software
engineering." IEEE Transactions on
software engineering 35.6 (2009): 756-
779.
70. Chapter: Constraint creation
How can we support users familiar with Linked Data
in viewing RDF constraints?
Users familiar with Linked Data
can answer questions about
visually represented RDF constraints
more accurately with a VOWL-based visual notation
than with a UML-based visual notation
72. Compare visual notations in a user study with 12 participants
Two visual notations
to visualize the same
semantic constructs
Test case Group 1 Group 2
Test case 1 ShapeUML ShapeVOWL
Test case 2 ShapeVOWL ShapeUML
Test case 3 ShapeUML ShapeVOWL
Test case 4 ShapeVOWL ShapeUML
ShapeVOWL ShapeUML
Pre assessment (social demographics + skills)
Main questionnaire to assess
accuracy of answers to provided questions
Post assessment (opinion)
75. Besides having 2 new visual notations,
we gained new qualitative insights!
Space efficient representation using ShapeUML
Good to have several notations because of familiarity bias
Visual features are important and can also improve ShapeUML
76. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
78. Valuable information in archived records
Historic government records or early climate data,
e.g. demographics or taxes on crop yields
Invaluable data loss
NASA is unable to locate the original high quality
moon landing video.
How about 21st century data?
Social media content influences the real world,
what if Twitter and Co are gone?
Historical records
Moon landing in the 1960s
The web and social media
79. BESOCIAL: a cross-institutional research project to
develop a social media archiving strategy for Belgium
Follow up of a project for
general web archiving
Lead by the Royal
Library of Belgium
Research partners
with different
expertise
Funded by the Belgian
Science Policy Office
84. Knowledge Graph-based workflow for data stewardship
Society
#meToo
#IchBinHanna
Data
format
A
Data
format
B
Heterogeneous data sources Knowledge Graph Views on the data in different formats
85. Quality is use-case specific and can be
systematically defined and measured
For example quality dimension
“Rich collection description”
86. Quality Assessment using
Knowledge Graphs and restrictions
40 user stories such as “As an archive-user, I want to see
descriptive information about the collection from the archivist,
so I can assess if the content is relevant to me.”
Derive quality requirements such as “The description of
each collection should at least have 200 characters”
Metric: Missing collection description
Metric: Number of missing descriptions
Metric: Insufficient collection description
Metric: Number of insufficient descriptions
Report
Quality
Assessment
88. A Knowledge Graph and restrictions supported data
stewardship for social media archiving
Providing an integrated view on the data (with formal
meaning)
Assisted in an automated quality assessment by using constraints
The workflow is generalizable thus helpful in other use cases
89. Users need support
Assessing restrictions using Montolo
Creating restrictions using visual notations
Using restrictions to enable data stewardship
Conclusion
90. How can we support users in the assessment and in the
creation of Knowledge Graph restrictions?
Montolo statistics support restriction assessments
with FAIR data which was not possible before
We can rethink the value we give to restrictions, why and
how do we use restrictions systematically?
91. How can we support users in the assessment and in the
creation of Knowledge Graph restrictions?
There are now 2 visual notations covering all
SHACL core constraints
First steps to make Knowledge Graph constraints more
accessible to domain experts
92. How can we support users in the assessment and in the
creation of Knowledge Graph restrictions?
The BESOCIAL use case demonstrated the use of restrictions
to tackle data stewardship challenges
The future is less about tools and more about workflows
and data!
93. A circle representing the human knowledge
“The illustrated guide to a Ph.D.” - Matt Might
94. Little knowledge after elementary school
“The illustrated guide to a Ph.D.” - Matt Might
95. More knowledge after high school
“The illustrated guide to a Ph.D.” - Matt Might
96. Gaining speciality with the Bachelor’s degree
“The illustrated guide to a Ph.D.” - Matt Might
97. Deepen speciality with the Master’s degree
“The illustrated guide to a Ph.D.” - Matt Might
98. Reading research papers takes you to the edge of human knowledge
“The illustrated guide to a Ph.D.” - Matt Might
99. You focus at the boundary
“The illustrated guide to a Ph.D.” - Matt Might
100. You focus at the boundary for a few years
“The illustrated guide to a Ph.D.” - Matt Might
101. One day the boundary gives way
“The illustrated guide to a Ph.D.” - Matt Might
102. The dent you have made is called PhD
“The illustrated guide to a Ph.D.” - Matt Might
103. The world looks different to you now
“The illustrated guide to a Ph.D.” - Matt Might
104. Don’t forget the bigger picture
“The illustrated guide to a Ph.D.” - Matt Might
105. Newly raised questions: future work
Montolo provides metrics, but what are the higher level
dimensions, tools and its usability?
Why and how are restrictions used in the first place?
How do we build our future Knowledge Graphs from a
methodological point of view?
106. Questions & Answers
Dissertation available as PDF at https://sven-lieber.org/phd
SvenLieber sven-lieber.org
knows.idlab.ugent.be