Panos Alexopoulos
Data and Knowledge Technologies
Professional
http://www.panosalexopoulos.com
p.alexopoulos@gmail.com
@PAlexop
How many truths can you handle?
Strategies and techniques for handling vagueness in
conceptual data models
Talk
Identity
Conceptual Data Models
Semantics
Semantic gap
Vagueness
What is it and how it differs
from other phenomena
Guidelines and (automatic)
techniques
Approaches and trade-offs
Why you should care Metrics and methods
Topics covered
UNDERSTANDING
VAGUENESS
DETECTING
VAGUENESS
TACKLING
VAGUENESS
VAGUENESS
RAMIFICATIONS
MEASURING
VAGUENESS
Understanding
Vagueness
What it is and what it is not
The Sorites Paradox
● 1 grain of wheat does not make a heap.
● If 1 grain doesn’t make a heap, then 2 grains
don’t.
● If 2 grains don’t make a heap, then 3 grains
don’t.
● …
● If 999,999 grains don’t make a heap, then 1
million grains don’t.
● Therefore, 1 million grains don’t make a
heap!
What is vagueness
“Vagueness is a semantic
phenomenon where predicates admit
borderline cases, namely cases where
it is not determinately true that the
predicate applies or not”
—Shapiro 2006
What is not vagueness
AMBIGUITY
E.g., “Last week I visited
Tripoli”
INEXACTNESS
E.g., “My height is
between 165 and 175
cm”
UNCERTAINTY
E.g., “The temperature
in Amsterdam right now
might be 15 degrees”,
Vagueness Types
QUANTITATIVE
Borderline cases stem from the
lack of precise boundaries along
some measurable dimension
(e.g. “Bald”, “Tall”, “Near”)
QUALITATIVE
Borderline cases stem from not
being able to decide which
dimensions and conditions are
sufficient and/or necessary for
the predicate to apply. (e.g.,
“Religion”, “Expert”)
Vagueness
Ramifications
Why should we care
Miscommunication
Disagreements
How would you model this?
Problematic Scenarios
USING
VAGUE DATA
REUSING
VAGUE DATA
INTEGRATING
VAGUE DATA
Detecting
Vagueness
Where and what to look
How to detect vagueness
● Identify which of your data model’s
elements are vague
● Investigate whether these elements are
indeed vague.
● Investigate and determine potential
dimensions and applicability contexts.
Where to look
● Classes: E.g. “Tall Person”, “Strategic
Customer”, “Experienced Researcher”
● Relations and attributes: E.g., “hasGenre”,
“hasIdeology”
● Attribute values: E.g., the “price” of a
restaurant could take as values the vague
terms “cheap”, “moderate” and “expensive”
What to look for
● Vague terms in names and definitions
● Disagreements and inconsistencies among
data modelers, domain experts, and data
stewards during model development and
maintenance
● Disagreements and inconsistencies in user
feedback during model application.
Examples from Wordnet
Vague senses Non vague senses
Yellowish: of the color intermediate
between green and orange in the color
spectrum, of something resembling the
color of an egg yolk.
Compound: composed of more than one
part
Impenitent: impervious to moral persuasion Biweekly: occurring every two weeks.
Notorious: known widely and usually
unfavorably
Outermost: situated at the farthest possible
point from a center.
Examples from the Citation Ontology
Vague relations Non vague relations
plagiarizes: A property indicating that the
author of the citing entity plagiarizes
the cited entity, by including textual or other
elements from the cited entity
without formal acknowledgement of their
source.
sharesAuthorInstitutionWith: Each entity
has at least one author that shares a
common institutional affiliation with an
author of the other entity.
citesAsAuthority: The citing entity cites the
cited entity as one that provides an
authoritative description or definition of the
subject under discussion.
retracts: The citing entity constitutes a
formal retraction of the cited entity.
supports: The citing entity provides
intellectual or factual support for
statements, ideas or conclusions presented
in the cited entity.
includesExcerptFrom: The citing entity
includes one or more excerpts from the
cited entity.
Measuring
Vagueness
Key metrics
Vagueness spread
● The ratio of model elements (classes,
relations, datatypes, etc) that are vague
● A data model with a high vagueness spread
is less explicit and shareable than an
ontology with a low one.
Vagueness intensity
● The degree to which the model’s users disagree
on the validity of the (potential) instances of the
elements.
● The higher this disagreement is for an element,
the more problems the element is likely to cause.
● Calculation:
○ Consider a sample set of vague element
instances
○ Have human judges denote whether and to
what extent they believe these instances are
valid
○ Measure the inter-agreement between users
(e.g. by using Cohen’s kappa)
Tackling
Vagueness
Approaches and trade-offs
Three (complementary)
techniques
VAGUENESS
AWARENESS
TRUTH
CONTEXTUALIZATIO
N
TRUTH
FUZZIFICATION
Vagueness-aware data models
Data models whose vague elements
are accompanied by meta-information
that describes the nature and
characteristics of their vagueness in
an explicit way.
E.g. “Tall Person” is vague and
“Adult” is non-vague
E.g. “Strategic Client" is vague
in the dimension of the
generated revenue”
E.g. “Strategic Client" is vague
in the dimension of the
generated revenue according
to the Financial Manager.
E.g. “Low Budget” has
quantitative vagueness and
“Expert Consultant” qualitative.
E.g. “Strategic Client" is vague
in the dimension of the
generated revenue in the
context of Financial Reporting”
What to make explicit
VAGUENESS EXISTENCE VAGUENESS DIMENSIONS
VAGUENESS
PROVENANCE
VAGUENESS TYPE
APPLICABILITY
CONTEXTS
A Vagueness Metamodel
Truth contextualization
● The same statement in the data model can be true
in some contexts and false in other contexts.
● E.g., “Stephen Curry is short” is true in the context
of “Basketball Playing” but false in all others.
● Potential contexts:
○ Cultures
○ Locations
○ Industries
○ Processes
○ Demographics
○ ...
Contextualized poverty
When to contextualize?
● When vagueness intensity is high and consensus
is impossible
● When you are able to identify truth contexts
● When the applications that use the model
applications can actually handle the contexts.
● When contextualization actually manages to
reduce disagreements and have a positive effect
to the model’s applications.
● When the contextualization benefits outweigh the
context management overhead.
Truth fuzzification
● The basic idea is that we can assign a real number
to a vague statement, within a range from 0 to 1.
○ A value of 1 would mean that the statement
is completely true
○ A value of 0 that it is completely false
○ Any value in between that it is “partly true” to
a given, quantifiable extent.
● For example:
○ “John is an instance of YoungPerson to a
degree of 0.8”
○ “Google has Competitor Microsoft B to a
degree of 0.4”.
● The premise is that fuzzy degrees can reduce the
disagreements around the truth of a vague
statement.
Truth degrees are not
probabilities● A probability statement is about quantifying the
likelihood of events or facts whose truth conditions
are well defined to come true
○ e.g., “it will rain tomorrow with a probability of
0.8”
● A fuzzy statement is about quantifying the extent to
which events or facts whose truth conditions are
undefined to be perceived as true.
○ e.g., “It’s now raining to a degree of 0.6”
● That’s the reason why they are supported by different
mathematical frameworks, namely probability theory
and fuzzy logic
What fuzzification involves
1. Detect and analyze all vague elements in your
model
1. Decide how to fuzzify each element
1. Harvest truth degrees
1. Assess fuzzy model quality
1. Represent fuzzy degrees
1. Apply the fuzzy model
Fuzzification options
● The number and kind of fuzzy degrees you
need to acquire for your model’s vague
elements depend on the latter’s vagueness
type and dimensions.
● If your element has quantitative vagueness
in one dimension, then all you need is a
fuzzy membership function that maps
numerical values of the dimension to fuzzy
degrees in the range [0,1]
Fuzzy membership functions
Fuzzy membership functions
Fuzzification options
● If an element has quantitative vagueness in
more than one dimensions then you can
either:
○ Define a multivariate fuzzy
membership function
○ Define one membership function per
dimension and then combine these via
some fuzzy logic operation, like fuzzy
conjunction or fuzzy disjunction
Multivariate fuzzy membership function
Fuzzy conjunction and disjunction
Fuzzification options
● A third option is to just define one direct degree
per statement.
○ “John is tall to a degree of 0.8”
○ “Maria is expert in data modeling to a degree
of 0.6”
● This approach makes sense when:
○ Your element is vague in too many
dimensions and you cannot find a proper
membership function,
○ When the element’s vagueness is qualitative
and, thus, you have no dimensions to use.
● The drawback is that you will have to harvest a lot
of degrees!
Harvesting truth degrees
● Remember that vague statements provoke
disagreements and debates among people or
even among people and systems.
● To generate fuzzy degrees for these statements
you need practically to capture and quantify these
disagreements.
● How to capture:
○ Ask people directly
○ Ask people indirectly
○ Mine from data
Explanation and feedback based harvesting
Multiple fuzzy truths
● Even with fuzzification you still may be getting
disagreements
● This can be an indication of context-dependence
● Different contexts may require different fuzzy
degrees or membership functions
● In other words, contextualization and fuzzification
are orthogonal approaches.
Fuzzy model quality
● Main questions you need to consider:
○ Have I fuzzified the correct elements?
○ Are the truth degrees consistent?
○ Are the truth degrees accurate?
○ Is the provenance of the truth degrees well
documented?
● Both accuracy and consistency are best treated
not as a binary metric but rather as a distance
Fuzzy model representation
● To represent a truth degree for a relation you
simply need to define a relation attribute named
“truth degree” or similar.
● This is straightforward if you work with E-R
models or property graphs, but also possible in
RDF or OWL, even if these languages do not
directly support relation attributes.
● Things can become more difficult when you need
to represent fuzzy membership functions or more
complex fuzzy rules and axioms, along with their
necessary reasoning support.
Fuzzy model application
● This last step might not look like a semantic
modeling task, yet it is a crucial one if you want
your fuzzification effort to pay off
● A fuzzy data model can be helpful in:
○ Semantic tagging and disambiguation
○ Semantic search and match
○ Decision support systems
○ Conversational agents (aka chatbots)
● In both cases proper design and adaptation of
the underlying algorithms is needed
When to fuzzify?
● Questions you need to consider:
○ Which elements in your model are
unavoidably vague?
○ How severe and impactful are the
disagreements you (expect to) get on the
veracity of these vague elements?
○ Are these disagreements caused by
vagueness or other factors?
When to fuzzify?
● Questions you need to consider:
○ If your model’s elements had fuzzy degrees,
would you get less disagreement?
○ Are the applications that use the model able
to exploit and benefit from truth degrees?
○ Can you develop a scalable way to get and
maintain fuzzy degrees that costs less than
the benefits they bring you?
How would you tackle this?
How would you tackle this?
● (Perceived) inaccuracy
● Disagreements and
misinterpretations
● Reduced semantic
interoperability
Take Aways
Data and information
quality can be negatively
affected by vagueness
● It’s how we think and
communicate
● Insisting on crispness is
unproductive
● But leaving things as-is is
also bad.
Treating vagueness as
noise doesn’t help
● Make your data models
Vagueness-Aware
● Contextualize truth
● Fuzzify truth
Three complementary
weapons to tackle
vagueness
Currently writing a book on
semantic data modeling
To be published by O’Reilly in
September 2020
Early release expected at O’Reilly
Learning Platform in December
2019
To get news about the book
progress and a free preview
chapter send me an email to
p.alexopoulos@gmail.com
How many truths can you handle?

How many truths can you handle?

  • 1.
    Panos Alexopoulos Data andKnowledge Technologies Professional http://www.panosalexopoulos.com p.alexopoulos@gmail.com @PAlexop How many truths can you handle? Strategies and techniques for handling vagueness in conceptual data models
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    What is itand how it differs from other phenomena Guidelines and (automatic) techniques Approaches and trade-offs Why you should care Metrics and methods Topics covered UNDERSTANDING VAGUENESS DETECTING VAGUENESS TACKLING VAGUENESS VAGUENESS RAMIFICATIONS MEASURING VAGUENESS
  • 8.
  • 9.
    The Sorites Paradox ●1 grain of wheat does not make a heap. ● If 1 grain doesn’t make a heap, then 2 grains don’t. ● If 2 grains don’t make a heap, then 3 grains don’t. ● … ● If 999,999 grains don’t make a heap, then 1 million grains don’t. ● Therefore, 1 million grains don’t make a heap!
  • 10.
    What is vagueness “Vaguenessis a semantic phenomenon where predicates admit borderline cases, namely cases where it is not determinately true that the predicate applies or not” —Shapiro 2006
  • 11.
    What is notvagueness AMBIGUITY E.g., “Last week I visited Tripoli” INEXACTNESS E.g., “My height is between 165 and 175 cm” UNCERTAINTY E.g., “The temperature in Amsterdam right now might be 15 degrees”,
  • 12.
    Vagueness Types QUANTITATIVE Borderline casesstem from the lack of precise boundaries along some measurable dimension (e.g. “Bald”, “Tall”, “Near”) QUALITATIVE Borderline cases stem from not being able to decide which dimensions and conditions are sufficient and/or necessary for the predicate to apply. (e.g., “Religion”, “Expert”)
  • 13.
  • 14.
  • 15.
  • 17.
    How would youmodel this?
  • 18.
  • 20.
  • 21.
    How to detectvagueness ● Identify which of your data model’s elements are vague ● Investigate whether these elements are indeed vague. ● Investigate and determine potential dimensions and applicability contexts.
  • 22.
    Where to look ●Classes: E.g. “Tall Person”, “Strategic Customer”, “Experienced Researcher” ● Relations and attributes: E.g., “hasGenre”, “hasIdeology” ● Attribute values: E.g., the “price” of a restaurant could take as values the vague terms “cheap”, “moderate” and “expensive”
  • 23.
    What to lookfor ● Vague terms in names and definitions ● Disagreements and inconsistencies among data modelers, domain experts, and data stewards during model development and maintenance ● Disagreements and inconsistencies in user feedback during model application.
  • 24.
    Examples from Wordnet Vaguesenses Non vague senses Yellowish: of the color intermediate between green and orange in the color spectrum, of something resembling the color of an egg yolk. Compound: composed of more than one part Impenitent: impervious to moral persuasion Biweekly: occurring every two weeks. Notorious: known widely and usually unfavorably Outermost: situated at the farthest possible point from a center.
  • 25.
    Examples from theCitation Ontology Vague relations Non vague relations plagiarizes: A property indicating that the author of the citing entity plagiarizes the cited entity, by including textual or other elements from the cited entity without formal acknowledgement of their source. sharesAuthorInstitutionWith: Each entity has at least one author that shares a common institutional affiliation with an author of the other entity. citesAsAuthority: The citing entity cites the cited entity as one that provides an authoritative description or definition of the subject under discussion. retracts: The citing entity constitutes a formal retraction of the cited entity. supports: The citing entity provides intellectual or factual support for statements, ideas or conclusions presented in the cited entity. includesExcerptFrom: The citing entity includes one or more excerpts from the cited entity.
  • 26.
  • 27.
    Vagueness spread ● Theratio of model elements (classes, relations, datatypes, etc) that are vague ● A data model with a high vagueness spread is less explicit and shareable than an ontology with a low one.
  • 28.
    Vagueness intensity ● Thedegree to which the model’s users disagree on the validity of the (potential) instances of the elements. ● The higher this disagreement is for an element, the more problems the element is likely to cause. ● Calculation: ○ Consider a sample set of vague element instances ○ Have human judges denote whether and to what extent they believe these instances are valid ○ Measure the inter-agreement between users (e.g. by using Cohen’s kappa)
  • 29.
  • 30.
  • 31.
    Vagueness-aware data models Datamodels whose vague elements are accompanied by meta-information that describes the nature and characteristics of their vagueness in an explicit way.
  • 32.
    E.g. “Tall Person”is vague and “Adult” is non-vague E.g. “Strategic Client" is vague in the dimension of the generated revenue” E.g. “Strategic Client" is vague in the dimension of the generated revenue according to the Financial Manager. E.g. “Low Budget” has quantitative vagueness and “Expert Consultant” qualitative. E.g. “Strategic Client" is vague in the dimension of the generated revenue in the context of Financial Reporting” What to make explicit VAGUENESS EXISTENCE VAGUENESS DIMENSIONS VAGUENESS PROVENANCE VAGUENESS TYPE APPLICABILITY CONTEXTS
  • 33.
  • 34.
    Truth contextualization ● Thesame statement in the data model can be true in some contexts and false in other contexts. ● E.g., “Stephen Curry is short” is true in the context of “Basketball Playing” but false in all others. ● Potential contexts: ○ Cultures ○ Locations ○ Industries ○ Processes ○ Demographics ○ ...
  • 35.
  • 36.
    When to contextualize? ●When vagueness intensity is high and consensus is impossible ● When you are able to identify truth contexts ● When the applications that use the model applications can actually handle the contexts. ● When contextualization actually manages to reduce disagreements and have a positive effect to the model’s applications. ● When the contextualization benefits outweigh the context management overhead.
  • 37.
    Truth fuzzification ● Thebasic idea is that we can assign a real number to a vague statement, within a range from 0 to 1. ○ A value of 1 would mean that the statement is completely true ○ A value of 0 that it is completely false ○ Any value in between that it is “partly true” to a given, quantifiable extent. ● For example: ○ “John is an instance of YoungPerson to a degree of 0.8” ○ “Google has Competitor Microsoft B to a degree of 0.4”. ● The premise is that fuzzy degrees can reduce the disagreements around the truth of a vague statement.
  • 38.
    Truth degrees arenot probabilities● A probability statement is about quantifying the likelihood of events or facts whose truth conditions are well defined to come true ○ e.g., “it will rain tomorrow with a probability of 0.8” ● A fuzzy statement is about quantifying the extent to which events or facts whose truth conditions are undefined to be perceived as true. ○ e.g., “It’s now raining to a degree of 0.6” ● That’s the reason why they are supported by different mathematical frameworks, namely probability theory and fuzzy logic
  • 39.
    What fuzzification involves 1.Detect and analyze all vague elements in your model 1. Decide how to fuzzify each element 1. Harvest truth degrees 1. Assess fuzzy model quality 1. Represent fuzzy degrees 1. Apply the fuzzy model
  • 40.
    Fuzzification options ● Thenumber and kind of fuzzy degrees you need to acquire for your model’s vague elements depend on the latter’s vagueness type and dimensions. ● If your element has quantitative vagueness in one dimension, then all you need is a fuzzy membership function that maps numerical values of the dimension to fuzzy degrees in the range [0,1]
  • 41.
  • 42.
  • 43.
    Fuzzification options ● Ifan element has quantitative vagueness in more than one dimensions then you can either: ○ Define a multivariate fuzzy membership function ○ Define one membership function per dimension and then combine these via some fuzzy logic operation, like fuzzy conjunction or fuzzy disjunction
  • 44.
  • 45.
  • 46.
    Fuzzification options ● Athird option is to just define one direct degree per statement. ○ “John is tall to a degree of 0.8” ○ “Maria is expert in data modeling to a degree of 0.6” ● This approach makes sense when: ○ Your element is vague in too many dimensions and you cannot find a proper membership function, ○ When the element’s vagueness is qualitative and, thus, you have no dimensions to use. ● The drawback is that you will have to harvest a lot of degrees!
  • 47.
    Harvesting truth degrees ●Remember that vague statements provoke disagreements and debates among people or even among people and systems. ● To generate fuzzy degrees for these statements you need practically to capture and quantify these disagreements. ● How to capture: ○ Ask people directly ○ Ask people indirectly ○ Mine from data
  • 48.
    Explanation and feedbackbased harvesting
  • 49.
    Multiple fuzzy truths ●Even with fuzzification you still may be getting disagreements ● This can be an indication of context-dependence ● Different contexts may require different fuzzy degrees or membership functions ● In other words, contextualization and fuzzification are orthogonal approaches.
  • 50.
    Fuzzy model quality ●Main questions you need to consider: ○ Have I fuzzified the correct elements? ○ Are the truth degrees consistent? ○ Are the truth degrees accurate? ○ Is the provenance of the truth degrees well documented? ● Both accuracy and consistency are best treated not as a binary metric but rather as a distance
  • 51.
    Fuzzy model representation ●To represent a truth degree for a relation you simply need to define a relation attribute named “truth degree” or similar. ● This is straightforward if you work with E-R models or property graphs, but also possible in RDF or OWL, even if these languages do not directly support relation attributes. ● Things can become more difficult when you need to represent fuzzy membership functions or more complex fuzzy rules and axioms, along with their necessary reasoning support.
  • 52.
    Fuzzy model application ●This last step might not look like a semantic modeling task, yet it is a crucial one if you want your fuzzification effort to pay off ● A fuzzy data model can be helpful in: ○ Semantic tagging and disambiguation ○ Semantic search and match ○ Decision support systems ○ Conversational agents (aka chatbots) ● In both cases proper design and adaptation of the underlying algorithms is needed
  • 53.
    When to fuzzify? ●Questions you need to consider: ○ Which elements in your model are unavoidably vague? ○ How severe and impactful are the disagreements you (expect to) get on the veracity of these vague elements? ○ Are these disagreements caused by vagueness or other factors?
  • 54.
    When to fuzzify? ●Questions you need to consider: ○ If your model’s elements had fuzzy degrees, would you get less disagreement? ○ Are the applications that use the model able to exploit and benefit from truth degrees? ○ Can you develop a scalable way to get and maintain fuzzy degrees that costs less than the benefits they bring you?
  • 55.
    How would youtackle this?
  • 56.
    How would youtackle this?
  • 58.
    ● (Perceived) inaccuracy ●Disagreements and misinterpretations ● Reduced semantic interoperability Take Aways Data and information quality can be negatively affected by vagueness ● It’s how we think and communicate ● Insisting on crispness is unproductive ● But leaving things as-is is also bad. Treating vagueness as noise doesn’t help ● Make your data models Vagueness-Aware ● Contextualize truth ● Fuzzify truth Three complementary weapons to tackle vagueness
  • 60.
    Currently writing abook on semantic data modeling To be published by O’Reilly in September 2020 Early release expected at O’Reilly Learning Platform in December 2019 To get news about the book progress and a free preview chapter send me an email to p.alexopoulos@gmail.com