How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

How hard is this query?
Measuring the Semantic Complexity of
Schema-agnostic Queries
André Freitas, Juliano Efson Sales,
Siegfried Handschuh, Edward Curry
IWCS, London 2015

Outline
• Motivation
• Query Semantic Complexity & Entropy
• Entropy Measures
• Validation & Analysis
• Conclusions

Shift in the Database Landscape
 Very-large and dynamic “schemas”.
10s-100s attributes
1,000s-1,000,000s attributes
before 2000
circa 2015
4 Brodie & Liu, 2010

Databases for a Complex World
How do you query data on this scenario?
5

Schema-agnosticism
Abstraction
Layer
6
Who is the daughter of
Bill Clinton?
Bill
Clinton
Chelsea
Clinton
child

Schema-agnostic queries
Query approaches over structured databases which
allow users satisfying complex information needs
without the understanding of the representation
(schema) of the database.
7
Semantic Parsing

Vocabulary Problem for Databases
Query: Who is the daughter of Bill Clinton married to?
Quantify the Semantic Gap
Possible representations
8

Core Questions
• Can we measure the semantic complexity of a
query-DB mapping?
• What defines an “easy” or a “hard” query?
• Which are the best estimators?
9

Configuration space of semantic matchings
Quantify the Query-DB semantic gap
Not all queries are born equal!
11
Semantic Complexity & Entropy

Semantic Complexity & Entropy
• Structural/conceptual complexity
• Level of ambiguity/indeterminacy/vagueness
• Teminological gap
• Novelty
12

Semantic Configuration Space
mΣ(Q,DB)
13

Semantic Entropy Measures
Hsyntax
15
?
Hstruct
Hterm
HtermHmatching

In the scope of this work
• Entropy -> Entropy estimator, approximation.
16

Syntactic Entropy (Hsyntax)
• The syntactic entropy of a query is defined by the
possible syntactic configurations in which a query
can be interpreted under the database syntax.
• Estimate the uncertainty of the translation of the
query into the DB categories (IDB(Q)).
• Is a function of the probability of the syntactic
interpretation of a query.
17

Structural Entropy (Hstruct)
• The structural entropy defines the complexity of a
database based on the possible facts that can be
encoded under its schema.
• Pollard & Biermann, A measure of semantic
complexity for natural language systems (2000).
18

Terminological Entropy (Hterm)
• The terminological entropy focuses on quantifying an
estimate on the amount of ambiguity, synonymy and
vagueness for the query or database terms.
• Translational Entropy (Htrans) as an estimator.
• Melamed, Measuring semantic entropy (1997).
• Translation probability based on parallel corpora.
19

Matching Entropy (Hmatching)
• Consists of measures which describe the
uncertainty involved in the query-data
matching/alignment between query terms and
dataset entities.
• Provides an estimate based on the set of
potential alignments.
• Distributional entropy (Hdist): Estimator based on
distributional semantic models.
20

Query Features as Complexity
Estimators
• Query features (reference to data model/query
operator categories).
– Contains instance reference (named entities)
– Contains class reference
– Contains complex class reference
– Contains property
– Contains value
– Yes/No question
– Contains operator
21

Experimental Set-up
• Question Answering over Linked Data Test
Collection (Unger et al. 2011).
• QALD 2011 & 2012.
• 150 natural language queries over DBpedia
(RDF).
Dataset (DBpedia + YAGO classes):
45,768 properties
288,316 classes
9,434,677 instances
128,071,259 triples
23

Experimental Set-up
• Linear regression between each entropy
measure and the f-measure of the
participating QA systems.
• 4 QA systems:
– QALD 2011: PowerAqua, Freya (κ = 0.501, 95% confidence
interval, ‘moderate’ agreement).
– QALD 2012: QAKis, MHE (κ= 0.236, 95% confidence
interval, ‘fair’ agreement).
26

1st Analysis
• Linear regression model.
• Hsyntax, Hterm (Htrans), Hmatching (Hdist) and Hstruct
27

1st Analysis
• Higher correlation:
– Hsyntax (-)
– Hterm (Htrans) (-)
– Hmatching (Hdist) (-)
• Lower correlation:
– Hstruct
28

2nd Analysis
• Query features (reference to data model/query
operator categories).
– Contains instance reference (named entities)
– Contains class reference
– Contains complex class reference
– Contains property
– Contains value
– Yes/No question
– Contains operator
29

2nd Analysis
• Linear regression model.
30

2nd Analysis
• Higher correlation:
– References to instances (+)
– Presence of operators (-)
– Presence of complex classes (complex nominals) (-)
31

3rd Analysis
• Classification of the query-DB
terminological gap for each data
model category.
32

3rd Analysis
Lower terminological gap
Higher terminological gap

Query Classification
• % of unanswered questions:
– Syntactic complexity (Hsyntax): 51.7%
– Vocabulary gap (Hmatching, Hterm): 68.9%
– No reference to instance (named entity)
(Hstruct,Hterm): 20.6%
35

Limitations
• Validation of the regression model in a
different test collection.
• Distributional entropy needs a more
principled definition.
36

Minimizing Semantic
Entropy
Reflections on the Design of Schema-
agnostic Query Mechanisms
Or ....

Minimizing the Semantic Entropy for
the Semantic Matching
Definition of a semantic pivot: first query term to
be resolved in the database.
 Maximizes the reduction of the semantic
configuration space (Hstruct , Hmatch).
38

Semantic Pivots (Hstruct , Hmatch)
• Who is the daughter of Bill Clinton married to?
437100,184 62,781
> 4,580,000
dbpedia:spouse dbpedia:children :Bill_Clinton
39

Definition of a semantic pivot: first query term
to be resolved in the database.
 Less prone to more complex synonymic
expressions and abstraction-level differences
(Hterm , Hmatch).
40

Semantic Pivots
• Proper nouns tends to have high percentage of string
overlap for synonymic expressions.
William Jefferson Clinton
Bill Clinton
William J. Clinton
T. E. Lawrence
Thomas Edward
Lawrence
Lawrence of Arabia
Who is the daughter of Bill Clinton married to?
41

Definition of a semantic pivot: first query term to be
resolved in the database.
 Less prone to more complex synonymic expressions
and abstraction-level differences (Hterm , Hmatch).
 proper nouns >> nouns >> complex nominals >>
adjectives , verbs.
42

Semantic Matching
• Hsyntax is a strong estimator of query
complexity.
• Hmatching can be used as an estimator for the
quality of the predicate alignment.
• Hterm can be used as a heuristic for matching
complexity.
43

Conclusions
• Both entropy (Hsyntax, Hterm, Hmatching) and query features
(instances, complex classes, operators) can be used as
estimators for query semantic complexity.
• This can be incorporated as heuristics into schema-
agnostic query planning approaches (or approximate
semantic parsing) to maximize semantic matching
probabilities.
• Need for the construction of better semantic entropy
estimators.
44

How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

Similar to How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries (20)

More from Andre Freitas

More from Andre Freitas (13)

Recently uploaded

Recently uploaded (20)

How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries