1) The document discusses relational databases and the query language Datalog. It provides examples of flight data stored in relations and queried using relational algebra operations.
2) Complex conjunctive queries over the flight data relations are built up step-by-step using relational joins and selections to find which airlines fly directly from London to Glasgow.
3) The document introduces Datalog and its extensions for representing ontological knowledge and reasoning over semantic web databases.
OPAL: automated form understanding for the deep web - WWW 2012Giorgio Orsi
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
Querying UML Class Diagrams - FoSSaCS 2012Giorgio Orsi
UML Class Diagrams (UCDs) are the best known class-based formalism for conceptual modeling. They are used by software engineers to model the intensional structure of a system in terms of classes, attributes and operations, and to express constraints that must hold for every instance of the system. Reasoning over UCDs is of paramount importance in design, validation, maintenance and system analysis; however, for medium and large software projects, reasoning over UCDs may be impractical. Query answering, in particular, can be used to verify whether a (possibly incomplete) instance of the system modeled by the UCD, i.e., a snapshot, enjoys a certain property. In this work, we study the problem of querying UCD instances, and we relate it to query answering under guarded Datalog +/-, that is, a powerful Datalog-based language for ontological modeling. We present an expressive and meaningful class of UCDs, named UCDLog, under which conjunctive query answering is tractable in the size of the instances.
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)Giorgio Orsi
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.
Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.
Data Integration at the Ontology Engineering GroupOscar Corcho
Presentation done on the work being done on Data Integration at OEG-UPM (http://www.oeg-upm.net/), for the CredIBLE workshop, in Sophia-Antipolis (October 15th, 2012).
OPAL: automated form understanding for the deep web - WWW 2012Giorgio Orsi
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
Querying UML Class Diagrams - FoSSaCS 2012Giorgio Orsi
UML Class Diagrams (UCDs) are the best known class-based formalism for conceptual modeling. They are used by software engineers to model the intensional structure of a system in terms of classes, attributes and operations, and to express constraints that must hold for every instance of the system. Reasoning over UCDs is of paramount importance in design, validation, maintenance and system analysis; however, for medium and large software projects, reasoning over UCDs may be impractical. Query answering, in particular, can be used to verify whether a (possibly incomplete) instance of the system modeled by the UCD, i.e., a snapshot, enjoys a certain property. In this work, we study the problem of querying UCD instances, and we relate it to query answering under guarded Datalog +/-, that is, a powerful Datalog-based language for ontological modeling. We present an expressive and meaningful class of UCDs, named UCDLog, under which conjunctive query answering is tractable in the size of the instances.
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)Giorgio Orsi
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.
Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.
Data Integration at the Ontology Engineering GroupOscar Corcho
Presentation done on the work being done on Data Integration at OEG-UPM (http://www.oeg-upm.net/), for the CredIBLE workshop, in Sophia-Antipolis (October 15th, 2012).
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation
of an ontological query into an equivalent first-order query against the underlying extensional database.
We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
ROSeAnn - Reconciling Opinions of Semantic Annotators. VLDB 2014 Conference.
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often
have very different vocabularies, with both high-level and specialist concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware
that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications to benefit from the much richer vocabulary available in an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally
compare both these approaches with respect to ontology-unaware supervised approaches, and to individual annotators.
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
The Semantic Web effort has steadily been gaining traction in the recent years. In particular,Web search companies are recently realizing that their products need to evolve towards having richer semantic search capabilities. Description logics (DLs) have been adopted as the formal underpinnings for Semantic Web languages used in describing ontologies. Reasoning under uncertainty has recently taken a leading role in this arena, given the nature of data found on theWeb. In this paper, we present a probabilistic extension of the DL EL++ (which underlies the OWL2 EL profile) using Markov logic networks (MLNs) as probabilistic semantics. This extension is tightly coupled, meaning that probabilistic annotations in formulas can refer to objects in the ontology. We show that, even though the tightly coupled nature of our language means that many basic operations are data-intractable, we can leverage a sublanguage of MLNs that allows to rank the atomic consequences of an ontology relative to their probability values (called ranking queries) even when these values are not fully computed. We present an anytime algorithm to answer ranking queries, and provide an upper bound on the error that it incurs, as well as a criterion to decide when results are guaranteed to be correct.
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration.
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
We present Nyaya , a flexible system for the management of Semantic-Web data which couples a general-purpose storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets,
expressed in a variety of formalisms, by transforming them into a collection of Semantic Data Kiosks. Each kiosk exposes the native meta-data in a uniform fashion using Datalog± , a very general rule-based language for the representation of ontological constraints. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage. The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems.
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation
of an ontological query into an equivalent first-order query against the underlying extensional database.
We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
ROSeAnn - Reconciling Opinions of Semantic Annotators. VLDB 2014 Conference.
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often
have very different vocabularies, with both high-level and specialist concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware
that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications to benefit from the much richer vocabulary available in an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally
compare both these approaches with respect to ontology-unaware supervised approaches, and to individual annotators.
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
The Semantic Web effort has steadily been gaining traction in the recent years. In particular,Web search companies are recently realizing that their products need to evolve towards having richer semantic search capabilities. Description logics (DLs) have been adopted as the formal underpinnings for Semantic Web languages used in describing ontologies. Reasoning under uncertainty has recently taken a leading role in this arena, given the nature of data found on theWeb. In this paper, we present a probabilistic extension of the DL EL++ (which underlies the OWL2 EL profile) using Markov logic networks (MLNs) as probabilistic semantics. This extension is tightly coupled, meaning that probabilistic annotations in formulas can refer to objects in the ontology. We show that, even though the tightly coupled nature of our language means that many basic operations are data-intractable, we can leverage a sublanguage of MLNs that allows to rank the atomic consequences of an ontology relative to their probability values (called ranking queries) even when these values are not fully computed. We present an anytime algorithm to answer ranking queries, and provide an upper bound on the error that it incurs, as well as a criterion to decide when results are guaranteed to be correct.
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration.
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
We present Nyaya , a flexible system for the management of Semantic-Web data which couples a general-purpose storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets,
expressed in a variety of formalisms, by transforming them into a collection of Semantic Data Kiosks. Each kiosk exposes the native meta-data in a uniform fashion using Datalog± , a very general rule-based language for the representation of ontological constraints. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage. The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems.
Datalog and its Extensions for Semantic Web Databases
1. Datalog and its Extensions
for Semantic Web Databases
Georg Gottlob1 Giorgio Orsi1 Andreas Pieris1 Mantas Šimkus2
1Department of Computer Science, University of Oxford, UK
2Institute of Information Systems, Vienna University of Technology, Austria
Reasoning Web Summer School, Vienna, Austria, September 3 - 8, 2012
2. Structure of the Lecture
¡ Relational Databases and Datalog
¡ Complexity of Datalog
¡ Datalog and Ontological Reasoning
¡ Datalog§
4. Relational Databases
Predominant technology
for data storage and processing
Flight origin destination airline
“On the fly” example: VIE LHR BA
LHR EDI BA
LGW GLA U2
LCA VIE OS
Edinburgh
Glasgow Airport code city
VIE Vienna
London LHR London
LGW London
Larnaca LCA Larnaca
Vienna GLA Glasgow
EDI Edinburgh
5. Relational Databases (Terminology)
Flight origin destination airline
VIE LHR BA
LHR EDI BA
LGW GLA U2
LCA VIE OS
Airport code city
VIE Vienna
Constants LHR London
(from a domain)
LGW London
LCA Larnaca
VIE, LHR, …
GLA Glasgow
BA, U2, OS
EDI Edinburgh
Vienna, London, …
6. Relational Databases (Terminology)
Flight origin destination airline
VIE LHR BA
Relations LHR EDI BA
LGW GLA U2
LCA VIE OS
Airport code city
VIE Vienna
Constants LHR London
(from a domain)
LGW London
LCA Larnaca
VIE, LHR, …
GLA Glasgow
BA, U2, OS
EDI Edinburgh
Vienna, London, …
7. Relational Databases (Terminology)
Flight origin destination airline
VIE LHR BA
Relations LHR EDI BA
LGW GLA U2
LCA VIE OS
Tuples
Airport code city
VIE Vienna
Constants LHR London
(from a domain)
LGW London
Relational atoms
LCA Larnaca
VIE, LHR, …
GLA Glasgow Flight(LHR,EDI,BA)
BA, U2, OS
EDI Edinburgh
Vienna, London, … Airport(LGW,London)
8. Querying: Relational Algebra
List all the airlines
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
{BA, U2, OS}
Πairline Flight
9. Querying: Relational Algebra
List the codes of the airports in London
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
{LHR, LGW}
Πcode (σcity = “London” Airport)
10. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
T1 Ã σcity = “London” Airport
T2 Ã σcity = “Glasgow” Airport
11. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
Flight origin destination airline T1 code city
VIE LHR BA LHR London
LHR EDI BA LGW London
LGW GLA U2
LCA VIE OS
T2 code city
GLA Glasgow
T1 Ã σcity = “London” Airport
T2 Ã σcity = “Glasgow” Airport
12. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
Flight origin destination airline T1 code city
VIE LHR BA LHR London
LHR EDI BA LGW London
LGW GLA U2
LCA VIE OS
T2 code city
GLA Glasgow
T3 Ã Flight origin = code T1
13. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
T3 origin destination airline code city
LHR EDI BA LHR London
LGW GLA U2 LGW London
T2 code city
GLA Glasgow
T3 Ã Flight origin = code T1
14. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
T3 origin destination airline code city
LHR EDI BA LHR London
LGW GLA U2 LGW London
T2 code city
GLA Glasgow
T4 Ã T3 destination = code T2
15. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
T4 origin destination airline code city code city
LGW GLA U2 LGW London GLA Glasgow
T4 Ã T3 destination = code T2
16. Querying: Relational Algebra
Which airlines fly directly from London to Glasgow?
T4 origin destination airline code city code city
LGW GLA U2 LGW London GLA Glasgow
Πairline T4
17. Querying: FOL and SQL Representation
List all the airlines
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
FOL Representation SQL Representation
SELECT airline
9X9Y Flight(X,Y,Z)
FROM Flight
Z is free
18. Querying: FOL and SQL Representation
List the codes of the airports in London
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
FOL Representation SQL Representation
SELECT code
Airport(X,London) FROM Airport
WHERE city = “London”
19. Querying: FOL and SQL Representation
Which airlines fly directly from London to Glasgow?
Airport code city
Flight origin destination airline
VIE Vienna
VIE LHR BA
LHR London
LHR EDI BA
LGW London
LGW GLA U2
LCA Larnaca
LCA VIE OS
GLA Glasgow
EDI Edinburgh
SELECT F.airline
FROM Flight as F,
9X9Y Airport(X,London) ^ Airport as A1,
Airport(Y,Glasgow) ^ Airport as A2
WHERE A1.code = F.origin AND
Flight(X,Y,Z) ^
A2.code = F.destination AND
A1.city = “London” AND
Z is free A2.city = “Glasgow”
20. What we Cannot Ask?
Is Glasgow reachable from Vienna?
Edinburgh
Glasgow
London
Larnaca
Vienna
Looks like this FOL query can do the job…
9X9Y9Z9W9V Airport(X,Vienna) ^ Airport(Y,Glasgow) ^ Flight(X,Z,W) ^ Flight(Z,Y,V)
YES
21. What we Cannot Ask?
Is Glasgow reachable from Vienna?
Edinburgh
Glasgow
London Intermediate
Larnaca
Vienna
Looks like this FOL query can do the job…
9X9Y9Z9W9V Airport(X,Vienna) ^ Airport(Y,Glasgow) ^ Flight(X,Z,W) ^ Flight(Z,Y,V)
NO
22. What we Cannot Ask?
Is Glasgow reachable from Vienna?
Edinburgh
Glasgow
London Intermediate
Larnaca
Vienna
¡ List all the pairs of airports (A1,A2) such that A2 is reachable from A1
¡ Is their a pair (A1,A2) such that A1 is in Vienna and A2 is in Glasgow?
23. What we Cannot Ask?
Flight origin destination airline Airport code city
¡ List all the pairs of airports (A1,A2) such that A2 is reachable from A1
Reachable(X,Y) Flight(X,Y,Z)
Reachable(X,W) Flight(X,Y,Z), Reachable(Y,W)
¡ Is their a pair (A1,A2) such that A1 is in Vienna and A2 is in Glasgow?
Ans() Airport(X,Vienna), Airport(Y,Glasgow), Reachable(X,Y)
24. What we Cannot Ask?
Flight origin destination airline Airport code city
¡ List all the pairs of airports (A1,A2) such that A2 is reachable from A1
Reachable(X,Y) Flight(X,Y,Z)
Reachable(X,W) Flight(X,Y,Z), Reachable(Y,W)
RECURSION
RA, FOL, SQL not enough
25. What we Cannot Ask?
Flight origin destination airline Airport code city
¡ List all the pairs of airports (A1,A2) such that A2 is reachable from A1
Reachable(X,Y) Flight(X,Y,Z)
Reachable(X,W) Flight(X,Y,Z), Reachable(Y,W)
DATALOG
Select-Project-Join + Recursion
26. Datalog at First Glance
Transitive closure of a graph:
TrClosure(X,Y) Graph(X,Y)
TrClosure(X,Y) Graph(X,Z),TrClosure(Z,Y)
A B C D
27. Datalog at First Glance
Transitive closure of a graph:
TrClosure(X,Y) Graph(X,Y)
TrClosure(X,Y) Graph(X,Z),TrClosure(Z,Y)
Graph TrClosure
A B A B
B C A C
C D A D
B C
B D
A B C D C D
28. Datalog at First Glance
¡ Semantics - a mapping from instances over body-relations to instances
over head-relations
Graph TrClosure
A B A B
B C A C
C D A D
B C
B D
C D
¡ Equivalent approaches for defining the semantics:
¡ Model-Theoretic - logical sentences asserting a property of the result
¡ Fixpoint - solution of a fixpoint equation
¡ Proof-Theoretic - obtaining proofs of facts
29. Datalog
¡ Formal Syntax
¡ Fixpoint Semantics
¡ Complexity Results
Note: For details on the model-theoretic and proof-theoretic semantics see:
- Foundations of Databases by Abiteboul, Hull and Vianu
- Logic Programming and Databases by Ceri, Gottlob and Tanca
30. Syntax of Datalog
A Datalog rule is an expression
R0(X0) R1(X1),…,Rn(Xn)
head body
¡ n ¸ 0 (empty body)
¡ R0,…,Rn are relation symbols (or predicates)
¡ X0,…,Xn are tuples of terms (constants or variables)
¡ Safe - each variable occurring in head must occur in body
31. Syntax of Datalog
¡ Datalog program P - a (finite) set of Datalog rules
¡ Extensional predicate - does not occur in the head of a rule of P
¡ Intensional predicate - occurs in the head of some rule of P
¡ edb(P) - extensional predicates of P
¡ idb(P) - intensional predicates of P
¡ sch(P) - the set of predicates edb(P) [ idb(P)
32. Syntax of Datalog
¡ Datalog program P - a (finite) set of Datalog rules
¡ Extensional predicate - does not occur in the head of a rule of P
¡ Intensional predicate - occurs in the head of some rule of P
¡ edb(P) - extensional predicates of P
not necessary
¡ idb(P) - intensional predicates of P
¡ sch(P) - the set of predicates edb(P) [ idb(P)
33. Example: Recursive Program P
Is Glasgow reachable from Vienna?
Flight origin destination airline Airport code city
Reachable(X,Y) Flight(X,Y,Z)
Reachable(X,W) Flight(X,Y,Z),Reachable(Y,W)
Ans() Airport(X,Vienna),Airport(Y,Glasgow),Reachable(X,Y)
dom(P) = {Vienna, Glasgow}
sch(P) = {Flight, Airport, Reachable, Ans}
edb(P) = {Flight, Airport}
idb(P) = {Reachable, Ans}
34. Fixpoint Semantics
¡ Relies on the immediate consequence operator TP
¡ Given a database D and a Datalog program P, an atom R(c1,…,cn)
¡ is an immediate consequence for D and P if:
- R(c1,…,cn) 2 D, or
- exists R0(X0) R1(X1),…,Rn(Xn) in P, and a homomorphism h:
{h(R1(X1)),…,h(Rn(Xn))} µ D and h(R0(X0)) = R(c1,…,cn)
¡ TP - mapping from databases for sch(P) to databases for sch(P)
¡ TP (D) = { immediate consequences for D and P }
35. Fixpoint Semantics
¡ A crucial fact - for each P and D for edb(P):
TP has a minimum fixpoint containing D
I is a fixpoint of TP if TP(I) = I
Note: For the proof see, e.g., Theorem 12.3.2
in Foundations of Databases
¡ The semantics of P on D, denoted P(D), is this minimum fixpoint
¡ How do we compute it?
36. Fixpoint Semantics
¡ TP,0(I) = I and TP,i+1(I) = TP(TP,i(I))
¡ Compute TP,ω(I) = [i ¸ 0 TP,i(I)
37. Fixpoint Semantics: Example
¡ Let P the program which computes the transitive closure of a graph:
TrClosure(X,Y) Graph(X,Y)
TrClosure(X,Y) Graph(X,Z),TrClosure(Z,Y)
¡ Consider the input database D = {Graph(A,B),Graph(B,C),Graph(C,D)}
¡ TP,1(D) = D [ {TrClosure(A,B),TrClosure(B,C),TrClosure(C,D)}
¡ TP,2(D) = TP,1(D) [ {TrClosure(A,C),TrClosure(B,D)}
¡ TP,3(D) = TP,2(D) [ {TrClosure(A,D)}
¡ TP,4(D) = TP,3(D)
A B C D
¡ Thus, TP,ω(D) = TP,3(D)
38. Fixpoint Semantics
¡ TP,0(I) = I and TP,i+1(I) = TP(TP,i(I))
¡ Compute TP,ω(I) = [i ¸ 0 TP,i(I)
¡ TP,ω(D) is the minimum fixpoint of TP containing D
¡ TP,ω(D) is the µ-minimal model of P containing D
Note: For the proof see, e.g., Theorem 12.3.4
in Foundations of Databases
39. Model-Theoretic Approach
Transitive closure of a graph:
TrClosure(X,Y) Graph(X,Y)
TrClosure(X,Y) Graph(X,Z),TrClosure(Z,Y)
Graph TrClosure M
A B A B A A
B C A C A B
C D A D A C
B C A D
B D
C D D D
µ-minimal model not µ-minimal model
40. Complexity of Datalog
¡ Fact inference problem:
- Input: program P, database D for edb(P), atom α
- Question: P [ D ² α, or, equivalently, α 2 P(D)?
¡ Data complexity - P fixed, D part of the input
[Vardi, STOC 1982]
¡ Combined complexity - both P and D part of the input
[Vardi, STOC 1982]
41. Data Complexity of Datalog
Theorem: Datalog is PTIME-complete in data complexity
Proof (in PTIME):
¡ consider a program P and a database D
¡ |P(D)| · |sch(P)| ¢ (|dom(P) [ dom(D)|)maxarity
dom(P) - constants in P
maximum number of dom(D) - constants in D
tuples using constants of
dom(P) [ dom(D)
¡ maxarity is a constant ) P(D) can be constructed in PTIME
42. Data Complexity of Datalog
Theorem: Datalog is PTIME-complete in data complexity
Proof (PTIME-hard): reduction from fact inference for propositional LP(2)
43. Data Complexity of Datalog
¡ Propositional LP(2) - set of rules R0 R1,R2
¡ Fact inference for propositional LP(2):
- Input: propositional LP(2) program P, propositional atom Q
- Question: P ² Q?
44. PTIME-hardness of LP(2)
Theorem: LP(2) is PTIME-hard
Proof: Logspace reduction from Monotone Circuit Value Problem
g6
Ç
g4 g5
Æ Ç
g1 g2 g3
1 0 1
Does the circuit evaluate to true?
45. PTIME-hardness of LP(2)
Theorem: LP(2) is PTIME-hard
Proof: Logspace reduction from Monotone Circuit Value Problem
encoding of the circuit as LP(2) program P
g6
Ç
g6 g4
g4 g5 g6 g5
Æ Ç
g4 g1, g2
g5 g2
g5 g3
g1 g2 g3
1 0 1
g1
g3
Does the circuit evaluate to true? Circuit evaluates to true iff P ² g6
46. Data Complexity of Datalog
¡ Propositional LP(2) - set of rules R0 R1,R2
¡ Fact inference for propositional LP(2):
- Input: propositional LP(2) program P, propositional atom Q
- Question: P ² Q?
¡ PTIME-hard - we can also simulate a PTIME Turing machine
see, e.g., [Dantsin, Eiter, Gottlob & Voronkov, ACM Computing Surveys 2001]
47. Data Complexity of Datalog
Theorem: Datalog is PTIME-complete in data complexity
Proof (PTIME-hard): reduction from fact inference for propositional LP(2)
¡ For each R0 add in D the atom True(R0) encode program
¡ For each R0 R1,R2 add in D the atom S(R0,R1,R2) P in database D
¡ Construct the fixed Datalog program PDAT:
T(X) True(X)
meta-interpreter for LP(2)
T(Z) T(X),T(Y),S(Z,X,Y)
¡ P ² Q iff T(Q) 2 PDAT(D)
48. Combined Complexity of Datalog
Theorem: Datalog is EXPTIME-complete in combined complexity
Proof (in EXPTIME):
¡ consider a program P and a database D
¡ |P(D)| · |sch(P)| ¢ (|dom(P) [ dom(D)|)maxarity
maximum number of
tuples using constants of
dom(P) [ dom(D)
¡ P(D) can be constructed in EXPTIME
49. Combined Complexity of Datalog
Theorem: Datalog is EXPTIME-complete in combined complexity
Proof (EXPTIME-hard):
by simulating an EXPTIME Turing machine
50. Deterministic Turing Machine (DTM)
Sn{sacc} £ Σ ! S £ Σ £ {-1,0,1}
accepting state
M = (S, Σ, t, δ, s0, sacc)
states tape blank initial state
symbols symbol
51. Deterministic Turing Machine (DTM)
Sn{sacc} £ Σ ! S £ Σ £ {-1,0,1}
accepting state
M = (S, Σ, t, δ, s0, sacc)
states tape blank initial state
symbols symbol
δ(s1,a) = (s2,b,1)
IF at some time instant τ the machine is in sate s1, the cursor
points to cell κ, and this cell contains a
THEN at instant τ+1 the machine is in state s2, cell κ contains b,
and the cursor points to cell κ+1
52. EXPTIME-hardness of Datalog
The goal: encode the EXPTIME computation of a DTM M on input
string I with a Datalog program P, a database D, and an atom α such
that α 2 P(D) iff M accepts I in at most N = 2m steps, where m =|I|k
53. The Relational Schema
Time points and tape positions from 0 to N-1, are encoded using m-ary tuples
{0,1}m (recall that N = 2m) such that 0 = (0,…,0), 1 = (0,…,1), …, N-1 = (1,…,1)
¡ Symbol[a](T,C) - at time instant T, cell C contains a
¡ Cursor(T,C) - at time instant T, cursor points to cell C
¡ State[s](T) - at time instant T, the machine is in state s
¡ Accept(T) - at time instant T, the machine accepts
where T = T1,…,Tm and C = C1,…,Cm
54. Initialization Rules
¡ Assume that I = a1…an
¡ Assume that we have the relations Firstm, Succm and Ám (will be defined later)
the number i
Symbol[ai](T,ti) Firstm(T)
Cursor(T,T) Firstm(T)
State[s0](T) Firstm(T)
Symbol[t](T,Y) Firstm(T), Ám(t,C)
the number n
56. Inertia Rules
Cells which are not changed during the transition keep their old values
Symbol[a](T0,C) Symbol[a](T,C), Cursor(T,C0), Ám(C,C0), Succm(T,T0)
Symbol[a](T0,C) Symbol[a](T,C), Cursor(T,C0), Ám(C0,C), Succm(T,T0)
57. Accepting Rule
Once we reach the accepting state we accept
Accept State[sacc](T)
58. Defining Firstm, Succm and Ám
We assume that D = {First0(0), Last1(1), Succ1(0,1)}
Z 2 {0,1}
Succi+1(Z,X,Z,Y) Succi(X,Y)
Succi+1(Z,X,W,Y) Succ1(Z,W), Lasti(X), Firsti(Y) inductive definition
of Firsti+1 and Succi+1
Firsti+1(Z,X) First1(Z), Firsti(X)
Lasti+1(Z,X) Last1(Z), Lasti(X)
Ám(X,Y) Succm(X,Y)
definition of Ám
Ám(X,Y) Succm(X,Z), Ám(Z,Y)
59. Concluding EXPTIME-hardness of Datalog
¡ Several rules but polynomially many ) feasible in PTIME
¡ Accept 2 P(D) iff M accepts I in at most N steps
¡ Can be formally shown by induction on the time steps
60. Datalog as an Ontology Language
¡ Ontology languages are usually based on description logics (prev. lecture)
¡ Much is possible with Datalog
DL Axiom Datalog Rule
Parent u Male v Father Father(X) Parent(X),Male(X)
MetalDevice v 8hasPart.Metal Metal(Y) MetalDevice(X), hasPart(X,Y)
brotherOf v relativeOf relativeOf(X,Y) brotherOf(X,Y)
parentOf inv childOf childOf(Y,X) parentOf(X,Y)
trans(ancestorOf) ancOf(X,Z) ancOf(X,Y), ancOf(Y,Z)
SeniorEmp £ Emp v moreThan moreThan(X,Y) SeniorEmp(X), Emp(Y)
61. Datalog as an Ontology Language
¡ Ontology languages are usually based on description logics (prev. lecture)
¡ Much is not possible with Datalog
[Patel-Schneider & Horrocks, Journal of Web Semantics 2007]
DL Axiom ?
Employee v reportsTo 9Y reportsTo(X,Y) Employee(X)
funct(reportsTo) Y = Z reportsTo(X,Y), reportsTo(X,Z)
Employee disj Customer ? Employee(X), Customer(X)
62. Datalog§
¡ Extend Datalog by allowing in the head:
- Existential quantification (9)
Datalog[9,=,?]
- Equality atoms (=)
- Constant false (?)
¡ As we shall see, Datalog[9] is undecidable
¡ Datalog[9,=,?] is syntactically restricted ! Datalog§
63. Datalog Extensions
¡ Formal Syntax of Datalog[9]
¡ Fixpoint Semantics of Datalog[9]
¡ Undecidability of Datalog[9]
¡ Guardedness, Linearity and Stickiness
¡ Additional Features (=,?)
64. Syntax of Datalog[9]
A Datalog[9] rule is an expression
Y R0(X0,Y) R1(X1),…,Rn(Xn)
head body
¡ n ¸ 0 (empty body)
¡ R0,…,Rn are relation symbols (or predicates)
¡ X0,…,Xn are tuples of terms (constants or variables)
¡ Y is a tuple of variables (disjoint from X0 […[ Xn)
65. Syntax of Datalog[9]
¡ Datalog[9] program P - a (finite) set of Datalog[9] rules
¡ sch(P) - the set of predicates ocurring in P
¡ Note: sch(P) is no longer partitioned into idb(P) and edb(P) - why?
66. Syntax of Datalog[9]
¡ Datalog[9] program P - a (finite) set of Datalog[9] rules
¡ sch(P) - the set of predicates ocurring in P
¡ Note: sch(P) is no longer partitioned into idb(P) and edb(P) - why?
DL Ontology Datalog[9] Program P
Person v hasFather 9Y hasFather(X,Y) Person(X)
hasFather ¡ v Person Person(Y) hasFather(X,Y)
All predicates of sch(P) appear in the body and in the head
67. Fixpoint Semantics
Analogous to the fixpoint semantics of Datalog - chase procedure
Input: Database D, Datalog[9] program P
Output: Instance for sch(P) that satisfies P
Person(John)
9Y hasFather(X,Y) Person(X) Person(Y) hasFather(X,Y)
chase(D,P) = D [ ?
68. Fixpoint Semantics
Analogous to the fixpoint semantics of Datalog - chase procedure
Input: Database D, Datalog[9] program P
Output: Instance for sch(P) that satisfies P
Person(John)
9Y hasFather(X,Y) Person(X) Person(Y) hasFather(X,Y)
chase(D,P) = D [ {hasFather(John,z1)
69. Fixpoint Semantics
Analogous to the fixpoint semantics of Datalog - chase procedure
Input: Database D, Datalog[9] program P
Output: Instance for sch(P) that satisfies P
Person(John)
9Y hasFather(X,Y) Person(X) Person(Y) hasFather(X,Y)
chase(D,P) = D [ {hasFather(John,z1), Person(z1)
70. Fixpoint Semantics
Analogous to the fixpoint semantics of Datalog - chase procedure
Input: Database D, Datalog[9] program P
Output: Instance for sch(P) that satisfies P
Person(John)
9Y hasFather(X,Y) Person(X) Person(Y) hasFather(X,Y)
chase(D,P) = D [ {hasFather(John,z1), Person(z1), hasFather(z1,z2)
71. Fixpoint Semantics
Analogous to the fixpoint semantics of Datalog - chase procedure
Input: Database D, Datalog[9] program P
Output: Instance for sch(P) that satisfies P
Person(John)
9Y hasFather(X,Y) Person(X) Person(Y) hasFather(X,Y)
chase(D,P) = D [ {hasFather(John,z1), Person(z1), hasFather(z1,z2) …
72. Fixpoint Semantics
¡ Chase rule - the building block of the chase procedure
¡ A rule ρ = Y R0(X0,Y) R1(X1),…,Rn(Xn) is applicable to instance I if:
- exists homomorphism h such that {h(R1(X1)),…,h(Rn(Xn))} µ I
- but no μ ¶ h such that μ(R0(X0,Y)) 2 Ι
Ι = {S(a,b), R(a)} Ι = {S(b,a), R(a)}
μ = {X! a, Υ! b}
£
h = {X! a} h = {X! a}
9Y S(X,Y) R(X) 9Y S(X,Y) R(X)
73. Fixpoint Semantics
¡ Chase rule - the building block of the chase procedure
¡ A rule ρ = Y R0(X0,Y) R1(X1),…,Rn(Xn) is applicable to instance I if:
- exists homomorphism h such that {h(R1(X1)),…,h(Rn(Xn))} µ I
- but no μ ¶ h such that μ(R0(X0,Y)) 2 Ι
¡ Let J = I [ {μ(R0(X0,Y))}, where μ ¶ h and μ(Yi) is a fresh value not in I
ρ,h
¡ The result of applying ρ to Ι is J, denoted Ι J - chase step
74. Fixpoint Semantics
¡ A finite chase of D w.r.t. to P is a finite sequence
ρ1,h1 ρ2,h2 ρ3,h3 ρm,hm
D I1 I2 Im
¡ and chase(D,P) is defined as the instance Im
¡ An infinite chase of D w.r.t. to P is an infinite sequence
ρ1,h1 ρ2,h2 ρ3,h3 ρm,hm
D I1 I2 Im
¡ and chase(D,P) is defined as the instance [j ¸ 0 Ij (with I0 = D)
¡ The semantics of P on D, denoted P(D), is defined as chase(D,P)
75. Chase: A Universal Model
C = chase(D,P)
D
h1 h2
h2(C)
h1(C) . . .
I1
I2
8I (I model of D and P ) chase(D,P) hom I)
Implicit in [ Fagin, Kolaitis, Miller & Popa, Theoretical Computer Science 2005]
76. Chase: Uniqueness Property
¡ In general is not unique - depends on the order of rule application
D = {R(a)} ρ1 = 9Y S(Y) R(X) ρ2 = S(X) R(X)
Solution1 = {R(a), S(z), S(a)} ρ1 then ρ2
Solution2 = {R(a), S(a)} ρ2 then ρ1
¡ Unique up to homomorphic equivalence
h12 h23
C1 C2 C3
h21 h32
77. Chase: The Challenge of Infinity
¡ In general is infinite
D = {R(a,b)} 9Z R(Y,Z) R(X,Y)
Solution = {R(a,b),R(b,z1),R(z1,z2),R(z2,z3),…}
¡ For plain Datalog, the fixpoint semantics provides an algorithm
¡ The situation changes dramatically for Datalog[9] - undecidable
78. Undecidability of Datalog[9]
Theorem: Datalog[9] is undecidable
Proof :
by simulating a deterministic Turing machine with an empty tape
79. Build an Infinite Grid
c H
V
i-th horizontal line represents the
i-th configuration of the machine
Node(X) Start(X)
Initial(X) Start(X)
X Y
Start(c) fixes 9Y H(X,Y) Node(X)
the starting point Node(Y) H(X,Y)
9Y V(X,Y) Node(X) Z W
Node(Y) V(X,Y)
V(Y,W) H(X,Y), H(Z,W), V(X,Z)
80. Initialization Rules
t t t
s0
Initial(Y) Initial(X), H(X,Y)
Cursor[s0](X) Start(X)
Symbol[t](X) Initial(X)
82. Inertia Rules
MarkLeft MarkRight
a b c d
a b c d
MarkRight(Y) Mark(X), H(X,Y)
MarkRight(Y) MarkRight(X), H(X,Y)
Symbol[a](Y) MarkRight(X), Symbol[a](X), V(X,Y)
We need similar rules for the cells before the cursor
83. Accepting Rule
Once we reach the accepting state we accept
Accept Cursor[sacc](X)
Accept 2 P(D) iff the DTM accepts
84. Undecidability of Datalog[9]
Theorem: Datalog[9] is undecidable
Proof :
by simulating a deterministic Turing machine with an empty tape
… syntactic restrictions are needed!
85. Guarded Datalog[9]
¡ Inspired by the guarded fragment of first-order logic
[Andréka, van Benthem & Németi, Journal of Philosophical Logic 98]
¡ There exists a body-atom that contains all the body-variables - guard
Manager(X) Employee(X), supervisorOf(X,Y), Manager(Y)
¡ Chase has finite treewidth ) Guarded Datalog[9] is decidable
[Calì, Gottlob & Kifer, KR 2008]
86. Treewidth of the Chase
¡ Tree decomposition - a mapping of a graph into a tree
ABC
A B F
BCE Treewidth = 2
C G
CDE BEG
D E H
BFG EGH
¡ Treewidth - number of graph vertices mapped to any treenode (in fact, -1)
87. Treewidth of the Chase
¡ An instance I can be represented as a graph - Gaifman graph
R(a,b,c)
a d
S(c,d) c
T(c,d,e) b e
¡ Treewidth of I is defined as the treewidth of its Gaifman graph
¡ Chase has finite treewidth ) is a tree-like structure
88. Guarded Datalog[9]
¡ Inspired by the guarded fragment of first-order logic
[Andréka, van Benthem & Németi, Journal of Philosophical Logic 98]
¡ There exists a body-atom that contains all the body-variables - guard
Manager(X) Employee(X), supervisorOf(X,Y), Manager(Y)
¡ Chase has finite treewidth ) Guarded Datalog[9] is decidable
[Calì, Gottlob & Kifer, KR 08]
¡ What about the complexity of Guarded Datalog[9]? - now we consider
sdontological query answering (previous lecture)
89. Complexity of Guarded Datalog[9]
¡ Ontological query answering problem:
- Input: program P, database D for sch(P), conjunctive query Q
TBox ABox
- Question: P [ D ² Q, or, equivalently, P(D) ² Q?
¡ Data complexity - P and Q fixed, D part of the input
¡ Combined complexity - everything part of the input
90. Ontological Query Answering: Example
D P
9Y hasFather(X,Y) Person(X)
Person(John)
Person(Y) hasFather(X,Y)
P(D)
… Father(z,John) Person(z) …
91. Ontological Query Answering: Example
D P
9Y hasFather(X,Y) Person(X)
Person(John)
Person(Y) hasFather(X,Y)
P(D)
… Father(z,John) Person(z) …
Q1 ← Father(X,John), Person(X)
92. Ontological Query Answering: Example
D P
9Y hasFather(X,Y) Person(X)
Person(John)
Person(Y) hasFather(X,Y)
P(D)
… Father(z,John) Person(z) …
Q1 ← Father(X,John), Person(X)
Q2 ← Father(John,X)
93. Data Complexity of Guarded Datalog[9]
Theorem: Guarded Datalog[9] is PTIME-complete in data complexity
Proof (in PTIME):
¡ Guarded Datalog[9] enjoys the bounded guard-depth property
¡ Construct in PTIME the finite part C of the guarded chase forest
¡ Evaluate the given query over C
[Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
97. Data Complexity of Guarded Datalog[9]
Theorem: Guarded Datalog[9] is PTIME-complete in data complexity
Proof (in PTIME):
¡ Guarded Datalog[9] enjoys the bounded guard-depth property
¡ Construct in PTIME the finite part C of the guarded chase forest
¡ Evaluate the given query over C
[Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
98. Data Complexity of Datalog
Theorem: Datalog[] is PTIME-complete in data complexity
Proof (PTIME-hard): reduction from fact inference for propositional LP(2)
¡ For each R0 add in D the atom True(R0) encode program
¡ For each R0 R1,R2 add in D the atom S(R0,R1,R2) P in database D
¡ Construct the fixed Guarded Datalog[] program PDAT:
T(X) True(X)
meta-interpreter for LP(2)
T(Z) T(X),T(Y),S(Z,X,Y)
¡ P ² Q iff T(Q) 2 PDAT(D)
99. Combined Complexity of Guarded Datalog[9]
Theorem: Guarded Datalog[9] is 2EXPTIME-complete in comb. complexity
Proof:
¡ Upper bound: alternating EXPSPACE algorithm
¡ Lower bound: by simulating an alternating EXPSPACE TM
[Calì, Gottlob & Kifer, KR 2008]
100. DLs vs Guarded Datalog[9]
ELH: Popular DL (for biological applications) with PTIME data complexity
[Baader, IJCAI 2003 and Rosati, DL 2007]
ELH TBox Datalog[9] Representation
AvB B(X) A(X)
AuBvC C(X) A(X), B(X)
9R.A v B B(X) R(X,Y), A(Y)
A v 9R.B 9Y Aux(X,Y) A(X) R(X,Y) Aux(X,Y) B(Y) Aux(X,Y)
RvS S(X,Y) R(X,Y)
101. DLs vs Guarded Datalog[9]
DL-Lite: Popular family of DLs with AC0 data complexity (OWL 2 QL)
[Calvanese, De Giacomo, Lembo, Lenzerini & Rosati, Journal of Automated Reasoning 2007]
DL-LiteR TBox Datalog[9] Representation
AvB B(X) A(X)
A v 9R 9Y R(X,Y) A(X)
9R v A A(X) R(X,Y)
RvS S(X,Y) R(X,Y)
Note: Disjointness assertions are harmless - will be treated differently
102. Linear Datalog[9]
¡ Goal: Lightweight Datalog[9] fragment which is highly tractable
¡ There exists only one atom in the body
9Y hasFather(X,Y) Person(X)
¡ Enjoys first-order rewritability
[Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
103. First-Order Rewritability
Q P
compilation
QP Q*
evaluation
translation
first-order SQL
query
D
8D (P(D) ² Q , D ² Q*)
[Calvanese, De Giacomo, Lembo, Lenzerini & Rosati, Journal of Automated Reasonig 2007]
104. Bounded Derivation-Depth Property (BDDP)
D
Q
constant depth
w.r.t. D
C
chase graph of D w.r.t. P
(not the guarded chase forest)
P(D) ² Q ) C²Q
105. Bounded Derivation-Depth Property (BDDP)
D
Q
constant depth
w.r.t. D
C
chase graph of D w.r.t. P
(not the guarded chase forest)
Sufficient condition for first-order rewritability
106. Linear Datalog[9]
¡ Goal: Lightweight Datalog[9] fragment which is highly tractable
¡ There exists only one atom in the body
9Y hasFather(X,Y) Person(X)
¡ Enjoys first-order rewritability
[Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
¡ What about the complexity of Linear Datalog[9]?
107. Data Complexity of Linear Datalog[9]
Theorem: Linear Datalog[9] is in AC0 in data complexity
Proof:
¡ We exploit first-order rewritability of Linear Datalog[9]
¡ Construct the first-order rewritten query QFO (in constant time)
¡ Evaluate QFO over the given database (in AC0)
[Vardi, PODS 1995]
[Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
108. Combined Complexity of Linear Datalog[9]
Theorem: Linear Datalog[9] is PSPACE-complete in comb. complexity
Proof:
¡ Upper bound: exploit the BDDP D
implicit in [Johnson & Klug, JCSS 1984]
Q
...
¡ Lower bound: by simulating a PSPACE Turing machine
109. PSPACE-hardness of Linear Datalog[9]
¡ Assume that the tape alphabet is {0,1,t}
¡ Suppose that M halts on I = a1…am using n = mk cells, for k > 0
¡ Initial configuration (the database D)
Config(sinit,a1,…,am,t,…,t,1,0,…,0)
n-m n-1
¡ Transition rule - δ(s1,a) = (s2,b,1)
i-1 n-i
Config(s1,X1,…,Xi-1,a, Xi+1,…,Xn,0,…,0,1, 0,…,0)
Config(s2,X1,…,Xi-1,b, Xi+1,…,Xn,0,…,0,1, 0,…,0)
i n-i-1
¡ Accepting rule
Accept Config(sacc,X1,…,Xn,Y1,…,Yn)
111. But…
¡ What about joins in rule bodies?
9E Employee(E,D,P,A) Runs(D,P), Area(P,A)
¡ What about the DL assertion concept product?
biggerThan(E,M) Elephant(E), Mouse(M)
115. Data Complexity of Sticky Datalog[9]
Theorem: Sticky Datalog[9] is in AC0 in data complexity
Proof:
¡ Sticky Datalog[9] enjoys the BDDP D
Q
...
¡ Thus, Sticky Datalog[9] is first-order rewritable ) in AC0
[Calì, Gottlob & Pieris, VLDB 2010]
116. Combined Complexity of Sticky Datalog[9]
Theorem: Sticky Datalog[9] is EXPTIME-complete in combined complexity
Proof:
¡ EXPTIME-membership: construct a proof of the query by applying
an APSPACE procedure D
Q
¡ EXPTIME-hardness: fact inference for lossless Datalog programs
over {0,1} is EXPTIME-hard
S(X,Y,Z) P(X,Y),P(Y,Z),R(Z)
[Calì, Gottlob & Pieris, VLDB 2010]
117. Datalog as an Ontology Language
¡ Ontology languages are based on description logics (previous lecture)
¡ Much is not possible with Datalog
[Patel-Schneider & Horrocks, Journal of Web Semantics 2007]
DL Axiom ?
Employee v reportsTo 9Y reportsTo(X,Y) Employee(X)
funct(reportsTo) Y = Z reportsTo(X,Y), reportsTo(X,Z)
Employee disj Customer ? Employee(X), Customer(X)
118. Equality Atom
Xi = Xj R1(X1),…,Rn(Xn)
¡ Linear Datalog[9,=] is already undecidable
implicit in [Chandra & Vardi, SIAM Journal on Computing 1985]
¡ Separability: given P = P [ P=,
8D8Q: D ² P= ) P(D) ² Q iff P(D) ² Q
[Calì, Lembo & Rosati, PODS 2003]
¡ Non-conflicting Datalog[9,=]: sufficient condition for separability
see, e.g., [Calì, Gottlob & Lukasiewicz, Journal of Web Semantics 2012]
119. Truth Constant False
? R1(X1),…,Rn(Xn)
¡ Preliminary check without adding complexity - given P = P [ P?
P(D) ² Q
m
P(D) ² Q OR P(D) ² Qρ, for some ρ 2 P?
Qρ ← body(ρ)
121. Further Reading (Partial List)
Guarded Datalog[9] (and extensions)
- Andrea Calì, Georg Gottlob, and Michael Kifer. Taming the infinite chase: Query answering under expressive
relational constraints. In Proceedings of KR, pages 70-80, 2008.
- Andrea Calì, Georg Gottlob, and Thomas Lukasiewicz. A general Datalog-based framework for tractable query
answering over ontologies. J. Web Sem., 14:57-83, 2012.
- Jean-François Baget, Marie-Laure Mugnier, Sebastian Rudolph, and Michaël Thomazo. Walking the
complexity lines for generalized guarded existential rules. In Proceedings of IJCAI, pages 712-717, 2011.
Sticky Datalog[9] (and extensions)
- Andrea Calì, Georg Gottlob, Andreas Pieris. Advanced processing for ontological queries. PVLDB 3(1): 554-
565, 2010.
- Andrea Calì, Georg Gottlob, Andreas Pieris. Query answering under non-guarded rules in Datalog+/-. In
Proceedings of RR, pages 1-17, 2010.
- Georg Gottlob, Giorgio Orsi, Andreas Pieris. Ontological queries: Rewriting and optimization. In Proceedings
of ICDE, pages 2-13, 2011.
Weakly-acyclic Datalog[9] (and extensions)
- Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, Lucian Popa. Data exchange: Semantics and query answering.
Theor. Comput. Sci., 336(1): 89-124, 2005.
- Alin Deutsch, Alan Nash, Jeffrey B. Remmel. The chase revisited. In Proceedings of PODS, pages 149-158, 2008.
- Bruno Marnette. Generalized schema-mappings: From termination to tractability. In Proceedings of PODS, pages
13-22, 2009.
Shy Datalog[9]
- Nicola Leone, Marco Manna, Giorgio Terracina, Pierfrancesco Veltri. Efficiently computable Datalog∃ Programs.
In Proceedings of KR, 2012.