Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...LDBC council
Lijun Chang, DECRA Fellow at the University of New South Wales talked about how to make subgraph matching more efficient thanks to postponing Cartesian products.
The document discusses defining and computing the least general generalization (lgg) of RDF graphs and SPARQL queries. It introduces the concepts of RDF graphs, entailment between graphs, and materializing implicit triples using RDFS and RDF entailment rules. The document outlines contributions in defining and computing the lgg in RDF and SPARQL, and reporting on experiments using datasets like DBpedia and LUBM.
This document summarizes research on using finite element modeling and design optimization to develop blast mitigation solutions. It outlines the research background, literature review, FE model development, design optimization formulation, final optimized shapes for different test cases, compute time reduction methods, and conclusions. The research aims to minimize plastic strain under a blast load within mass and geometric constraints by varying panel shape parameters. Parallelization of the FE analysis achieved a speedup of over 2x. Future work includes considering new materials, local shape changes, different blast loads, and multi-objective optimization.
This document discusses computing commonalities between SPARQL conjunctive queries. It defines the concept of a least general generalization (lgg) of queries, which is a most general query that entails each of the input queries. The document presents definitions for lgg of basic graph pattern queries in SPARQL with respect to a set of RDF entailment rules and RDFS constraints. It focuses on computing the lgg of two queries by iteratively taking the lgg of query pairs. The goal is to study computing lgg in the conjunctive fragment of SPARQL to applications like query optimization and recommendation.
In this talk we will describe a methodology to handle the causality to make inference on common-cause failure in a situation of missing data. The data are collected in the form of contingency table but the available information are only the numbers of CCF of different orders and the numbers of failure due to a given cause. Therefore only the margins of the contingency table are observed; thefrequencies in each cell are unknown. Assuming a Poisson model for the count, we suggest a Bayesian approach and we use the inverse Bayes formula (IBF) combined with a Metropolis-Hastings algorithm to make inference on the rate of occurrence for the different combination cause, order. The performance of the resulting algorithm is evaluated through simulations. A comparison is made with results obtained from the _-composition approach to deal with causality suggested by Zheng et al. (2013).
On Context-Orientation in Aggregate ProgrammingRoberto Casadei
Context-awareness plays a central role in self-
adaptive software. By a programming perspective, context is
often used implicitly, and context-aware code is fragmented
in the codebase. In Context-Oriented Programming, instead,
context is considered a first-class citizen and is explicitly used
to modularise context-sensitive functionality and behavioural
variability. In this paper, we reflect on the role of context in
collective adaptive systems, by a discussion from the special
perspective of a macro paradigm, Aggregate Programming,
which supports the specification of collective behaviour by a
global perspective through functional compositions of field com-
putations. In particular, we consider the abstractions exposed in
Context-Oriented and Aggregate Programming, suggest potential
synergies in both directions, and accordingly take the first steps
towards a combined design.
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...LDBC council
Lijun Chang, DECRA Fellow at the University of New South Wales talked about how to make subgraph matching more efficient thanks to postponing Cartesian products.
The document discusses defining and computing the least general generalization (lgg) of RDF graphs and SPARQL queries. It introduces the concepts of RDF graphs, entailment between graphs, and materializing implicit triples using RDFS and RDF entailment rules. The document outlines contributions in defining and computing the lgg in RDF and SPARQL, and reporting on experiments using datasets like DBpedia and LUBM.
This document summarizes research on using finite element modeling and design optimization to develop blast mitigation solutions. It outlines the research background, literature review, FE model development, design optimization formulation, final optimized shapes for different test cases, compute time reduction methods, and conclusions. The research aims to minimize plastic strain under a blast load within mass and geometric constraints by varying panel shape parameters. Parallelization of the FE analysis achieved a speedup of over 2x. Future work includes considering new materials, local shape changes, different blast loads, and multi-objective optimization.
This document discusses computing commonalities between SPARQL conjunctive queries. It defines the concept of a least general generalization (lgg) of queries, which is a most general query that entails each of the input queries. The document presents definitions for lgg of basic graph pattern queries in SPARQL with respect to a set of RDF entailment rules and RDFS constraints. It focuses on computing the lgg of two queries by iteratively taking the lgg of query pairs. The goal is to study computing lgg in the conjunctive fragment of SPARQL to applications like query optimization and recommendation.
In this talk we will describe a methodology to handle the causality to make inference on common-cause failure in a situation of missing data. The data are collected in the form of contingency table but the available information are only the numbers of CCF of different orders and the numbers of failure due to a given cause. Therefore only the margins of the contingency table are observed; thefrequencies in each cell are unknown. Assuming a Poisson model for the count, we suggest a Bayesian approach and we use the inverse Bayes formula (IBF) combined with a Metropolis-Hastings algorithm to make inference on the rate of occurrence for the different combination cause, order. The performance of the resulting algorithm is evaluated through simulations. A comparison is made with results obtained from the _-composition approach to deal with causality suggested by Zheng et al. (2013).
On Context-Orientation in Aggregate ProgrammingRoberto Casadei
Context-awareness plays a central role in self-
adaptive software. By a programming perspective, context is
often used implicitly, and context-aware code is fragmented
in the codebase. In Context-Oriented Programming, instead,
context is considered a first-class citizen and is explicitly used
to modularise context-sensitive functionality and behavioural
variability. In this paper, we reflect on the role of context in
collective adaptive systems, by a discussion from the special
perspective of a macro paradigm, Aggregate Programming,
which supports the specification of collective behaviour by a
global perspective through functional compositions of field com-
putations. In particular, we consider the abstractions exposed in
Context-Oriented and Aggregate Programming, suggest potential
synergies in both directions, and accordingly take the first steps
towards a combined design.
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
This document provides an overview of parallel computing techniques in R using various packages like snow, multicore, and parallel. It begins with motivation for parallelizing R given its limitations of being single-threaded and memory-bound. It then covers the snow package which enables explicit parallelism across computer clusters. The multicore package provides implicit parallelism using forking, but is deprecated. The parallel package acts as a wrapper for snow and multicore. It also discusses load balancing, random number generation, and provides examples of using snow and multicore for parallel k-means clustering and lapply.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
This document introduces a unified framework for generalizing explanations for answers and non-answers to why/why-not questions over union of conjunctive queries (UCQs). It utilizes an available ontology, expressed as inclusion dependencies, to map concepts to instances and generate generalized explanations. Generalized explanations describe subsets of an explanation using concepts from the ontology. The most general explanation is the one that is not dominated by any other explanation. The approach is implemented using Datalog rules to model subsumption checking, successful and failed rule derivations, and computing explanations, their generalization, and the most general explanations.
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance- aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Explaining why an answer is present (traditional provenance) or absent (why-not provenance) from a query result is important for many use cases. Most existing approaches for positive queries use the existence (or absence) of input data to explain a (missing) answer. However, for realistically-sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that logical constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a why- or why-not provenance question. For instance, consider a query that returns the names of all cities which can be reached with at most one transfer via train from Lyon in France. The provenance of a city in the result of this query, say Dijon, will contain a large number of train connections between Lyon and Dijon which each justify the existence of Dijon in the result. If we are aware that Lyon and Dijon are cities in France (e.g., an ontology of geographical locations is available), then we can use this information to generalize the query output and its provenance to provide a more concise explanation of why Dijon is in the result. For instance, we may conclude that all cities in France can be reached from each other through Paris. We demonstrate how an ontology expressed as inclusion dependencies can provide meaningful justifications for answers and non-answers, and we outline how to find a most general such explanation for a given UCQ query result using Datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization.
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance, and evaluate the resulting system through extensive experiments.
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
This document discusses big data provenance and its implications for benchmarking. It begins by outlining provenance, describing challenges of big data provenance, and providing examples of approaches taken. It then discusses how provenance could be used for benchmarking by serving as data and workloads. Provenance-based metrics and using provenance for profiling and monitoring systems are proposed. Generating large datasets and workloads from provenance data is suggested to address issues with big data benchmarking.
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
TaPP 2013 - Provenance for Data MiningBoris Glavic
Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.
More Related Content
Similar to 2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
This document provides an overview of parallel computing techniques in R using various packages like snow, multicore, and parallel. It begins with motivation for parallelizing R given its limitations of being single-threaded and memory-bound. It then covers the snow package which enables explicit parallelism across computer clusters. The multicore package provides implicit parallelism using forking, but is deprecated. The parallel package acts as a wrapper for snow and multicore. It also discusses load balancing, random number generation, and provides examples of using snow and multicore for parallel k-means clustering and lapply.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
This document introduces a unified framework for generalizing explanations for answers and non-answers to why/why-not questions over union of conjunctive queries (UCQs). It utilizes an available ontology, expressed as inclusion dependencies, to map concepts to instances and generate generalized explanations. Generalized explanations describe subsets of an explanation using concepts from the ontology. The most general explanation is the one that is not dominated by any other explanation. The approach is implemented using Datalog rules to model subsumption checking, successful and failed rule derivations, and computing explanations, their generalization, and the most general explanations.
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance- aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Explaining why an answer is present (traditional provenance) or absent (why-not provenance) from a query result is important for many use cases. Most existing approaches for positive queries use the existence (or absence) of input data to explain a (missing) answer. However, for realistically-sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that logical constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a why- or why-not provenance question. For instance, consider a query that returns the names of all cities which can be reached with at most one transfer via train from Lyon in France. The provenance of a city in the result of this query, say Dijon, will contain a large number of train connections between Lyon and Dijon which each justify the existence of Dijon in the result. If we are aware that Lyon and Dijon are cities in France (e.g., an ontology of geographical locations is available), then we can use this information to generalize the query output and its provenance to provide a more concise explanation of why Dijon is in the result. For instance, we may conclude that all cities in France can be reached from each other through Paris. We demonstrate how an ontology expressed as inclusion dependencies can provide meaningful justifications for answers and non-answers, and we outline how to find a most general such explanation for a given UCQ query result using Datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization.
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance, and evaluate the resulting system through extensive experiments.
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
This document discusses big data provenance and its implications for benchmarking. It begins by outlining provenance, describing challenges of big data provenance, and providing examples of approaches taken. It then discusses how provenance could be used for benchmarking by serving as data and workloads. Provenance-based metrics and using provenance for profiling and monitoring systems are proposed. Generating large datasets and workloads from provenance data is suggested to address issues with big data benchmarking.
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
TaPP 2013 - Provenance for Data MiningBoris Glavic
Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
This document presents a vision for a generic provenance middleware called GProM that can compute provenance for database queries, updates, and transactions. Some key points:
- GProM uses query rewriting and annotation propagation techniques to compute provenance in a non-invasive way.
- It introduces the concept of "reenactment queries" to compute provenance for past transactions by simulating their effects using time travel to access past database states.
- The reenactment queries are then rewritten to propagate provenance annotations to compute the provenance of the entire transaction.
- GProM aims to support multiple provenance types and storage policies in a database-independent way through an extensible
This document discusses auditing and maintaining provenance in software packages. It presents CDE-SP, an enhancement to the CDE system that captures additional details about software dependencies to enable attribution of authorship as software packages are combined and merged into pipelines. CDE-SP uses a lightweight LevelDB storage to encode process and file provenance within software packages. It provides queries to retrieve dependency information and validate authorship by matching provenance graphs. Experiments show CDE-SP introduces negligible overhead compared to the original CDE system.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
1. Going Beyond Provenance: Explaining Query Answers
with Pattern-based Counterbalances
SIGMOD 2019
Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy
Illinois Institute of Technology Duke University
SIGMOD Research Session 5 - July 3rd - 11:30am
Slide 1 of 16 Q. Zeng - CAPE:
5. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
Slide 3 of 16 Q. Zeng - CAPE: Introduction
6. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction
Slide 3 of 16 Q. Zeng - CAPE: Introduction
7. Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction
All based on provenance
Slide 3 of 16 Q. Zeng - CAPE: Introduction
11. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Slide 4 of 16 Q. Zeng - CAPE: Introduction
12. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
13. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.
Boris: Okay, I’m cutting low your stipend.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
14. Only provenance is useful?
Boris: Why did you work only 2 hours yesterday?
Qitian: I was on a plane to SIGMOD for 8 hours.
Boris: Fair enough.
Slide 4 of 16 Q. Zeng - CAPE: Introduction
15. Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue
Slide 5 of 16 Q. Zeng - CAPE: Introduction
16. Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue
author venue year pubcnt
AX SIGKDD 2006 4
AX SIGKDD 2007 1
AX SIGKDD 2008 4
Slide 5 of 16 Q. Zeng - CAPE: Introduction
17. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Slide 6 of 16 Q. Zeng - CAPE: Introduction
18. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
Slide 6 of 16 Q. Zeng - CAPE: Introduction
19. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Slide 6 of 16 Q. Zeng - CAPE: Introduction
20. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Slide 6 of 16 Q. Zeng - CAPE: Introduction
21. Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query
Provenance-based approach
—By "intervention"
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Our approach
—By counterbalance
AX ’s high publication number in
other conference or other year
Slide 6 of 16 Q. Zeng - CAPE: Introduction
22. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Slide 7 of 16 Q. Zeng - CAPE: Introduction
23. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs
Slide 7 of 16 Q. Zeng - CAPE: Introduction
24. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance
Slide 7 of 16 Q. Zeng - CAPE: Introduction
25. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
Slide 7 of 16 Q. Zeng - CAPE: Introduction
26. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
offline Interactive with user question
Slide 7 of 16 Q. Zeng - CAPE: Introduction
27. Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern
Mine ARPs → Look for counterbalance → Present top k
offline Interactive with user question
CAPE
Slide 7 of 16 Q. Zeng - CAPE: Introduction
28. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
29. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
30. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
31. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
An aggregate function
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
32. Aggregate Regression Pattern
A set of partition attributes
P="For each author , the total publication (count(*)) is linear over
the years "
A set of predictor attributes
An aggregate function
A regression model type
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
33. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
34. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes Say,
P holds on AX
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
35. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
A pattern can also hold globally if it holds for sufficiently many values
of partition attributes (A good number of authors)
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
36. Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
A pattern can hold locally on a fixed value of partition attributes
A pattern can also hold globally if it holds for sufficiently many values
of partition attributes (A good number of authors)
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
37. Mining ARP
Brute Force: at least 3|R| candidate patterns
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
38. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
39. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
maximum 4 attributes in a pattern. This alone would reduce the
number of candidate patterns to polynomial.
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
40. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
41. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Partition Attributes Predictor Attributes
A,B,C D
A,B C,D
A B,C,D
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
42. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Detecting and Applying Functional Dependency
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
43. Mining ARP
Brute Force: at least 3|R| candidate patterns
Optimization:
Restricting size:
Reusing sort order
Detecting and Applying Functional Dependency
"For each A, agg(α) is linear over C"
A → B
⇒ "For each A and B, agg(α) is linear over C"
Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
45. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
46. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
47. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
48. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
49. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
AX ’s number of SIGKDD publications each year:
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
50. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
51. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
52. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"
AX ’s number of publications each year:
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
53. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
54. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
55. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
56. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
P1="For author AX and ICDE, the total publication is constant over
the years"
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
57. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P="For author AX , the total publication is linear over the years"
author AX and ICDE
constant
P1="For author AX and ICDE, the total publication is constant over
the years"
In this simple example it happens that we refined back to the same
attributes as user question but it doesn’t necessarily have to be
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
58. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P1="For author AX and ICDE, the total publication is constant over
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
59. Steps of Counterbalancing
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
1 Relevant pattern (Not all patterns are useful)
2 Refinement (There might not be direct counterbalance on relevant
pattern)
P1="For author AX and ICDE, the total publication is constant over
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1
t [pubcnt] = 6 is a high
outlier
Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
60. Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)
Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
61. Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)
Not all counterbalances are good. We need to score them and return top
ones.
Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
62. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
63. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
⇒ Tuples that are more similar are more likely to cause unusual result.
For φ=(AX , SIGKDD, 2007, 1), 2007 is better than 2006 for an
answer, ICDE is better than a conference in other area like SIGCOMM
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
64. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
65. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
66. Scoring Explanations
1 The distance between user question tuple and explanation tuple.
2 The deviation of explanation tuple from its expected value.
⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.
AX ’s SIGKDD publication: AX ’s ICDE publication:
Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
67. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
68. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
69. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
70. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
71. Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Explanation
rank type community year count(*) score
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0
4 Assault 26 2011 10 40.1
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
72. Conclusion & Future Work
Conclusions
Provenance may be insufficient
Reasonable explanations can be given by counterbalance
Mine patterns offline
Look for counterbalance and rank online
Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
73. Conclusion & Future Work
Conclusions
Provenance may be insufficient
Reasonable explanations can be given by counterbalance
Mine patterns offline
Look for counterbalance and rank online
Future Work
Extend to larger class of queries
e.g., joins
Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
75. References I
[Arab et al., 2014] Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., and Glavic, B. (2014).
A generic provenance middleware for database queries, updates, and transactions.
In Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance.
[Green et al., 2007] Green, T. J., Karvounarakis, G., and Tannen, V. (2007).
Provenance semirings.
In PODS, pages 31–40.
[Meliou et al., 2010] Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D. (2010).
The complexity of causality and responsibility for query answers and non-answers.
PVLDB, 4(1):34–45.
[Roy and Suciu, 2014] Roy, S. and Suciu, D. (2014).
A formal approach to finding explanations for database queries.
In SIGMOD, pages 1579–1590.
[Wu and Madden, 2013] Wu, E. and Madden, S. (2013).
Scorpion: Explaining away outliers in aggregate queries.
PVLDB, 6(8):553–564.
Slide 16 of 16 Q. Zeng - CAPE: Bibliography