This document presents a vision for a generic provenance middleware called GProM that can compute provenance for database queries, updates, and transactions. Some key points:
- GProM uses query rewriting and annotation propagation techniques to compute provenance in a non-invasive way.
- It introduces the concept of "reenactment queries" to compute provenance for past transactions by simulating their effects using time travel to access past database states.
- The reenactment queries are then rewritten to propagate provenance annotations to compute the provenance of the entire transaction.
- GProM aims to support multiple provenance types and storage policies in a database-independent way through an extensible
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
How to design a Disaster Recovery Plan for HDP (Hortonworks Data Platform) Clusters?
Mohamed Mehdi BEN AISSA, Big Data Practice Manager at FINAXYS and Big Data ITO at CACIB
For HDP Clusters, we suggest, in a first phase, different Disaster Recovery Plan solutions depending on the SLA (Service-level agreement): RPO (Recovery Point Objective), RTO (Recovery Time Objective). In a second phase, we focus more on the stretch cluster solution: the advantages, the drawbacks and the impact of this choice on the global architecture. Finally, we explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer ...) into the architecture.
Alibaba builds the data infrastructure with Apache Hadoop YARN since 2013, and till now it manages more than 10k nodes. In Alibaba, Hadoop YARN serves various systems such as search, advertising, and recommendation etc. It runs not just batch jobs, also streaming, machine learning, OLAP, and even online services that directly impact Alibaba’s user experience. To extend YARN’s ability to support such complex scenarios, we have done and leveraged a lot of YARN 3.x improvements. In this talk, you will find what are these improvements and how they helped to solve difficult problems in large production clusters.
This includes:
1. Highly improved performance with Capacity Scheduler’s async scheduling framework
2. Better placement decisions with node attributes, placement constraints
3. Better resource utilization with opportunistic containers
4. Introduce a load balancer to balance resource utilization
5. Generic resource types scheduling/isolation to manage new resources such as GPU and FPGA
In the presentation, we will further introduce how we build the entire ecosystem on top of YARN and how we keep evolving YARN’s ability to tackle the challenges brought by continuously increasing data and business in Alibaba.
Speakers
Weiwei Yang, Alibaba, Staff Software Engineer
Ren Chunde, Alibaba Group, Senior Engineer
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
This document summarizes an intern's report on their internship at Altisource working with big data on the Hadoop platform. The intern researched how data is stored in RDBMS versus Hadoop, learned query languages like HiveQL and MySQL, and gained knowledge of MapReduce, Spark, and Sqoop. As part of a project, the intern analyzed Altisource data using Apache Hadoop and Spark. The intern also set up a 3-node cluster on Google Cloud and used it to analyze NASA clickstream and other data, proving concepts like linear increase in processing time with data size. The intern concluded they gained exposure to querying languages and distributed computing concepts during the internship.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
This document provides an introduction to project management. It defines a project, compares projects and operations, and outlines what makes a project successful or fail. It then defines project management and its key areas including scope, issue, cost, quality, communications, risk, and change management. The five phases of project management are also outlined. Finally, it discusses common project management tools and the role of the project manager.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
How to design a Disaster Recovery Plan for HDP (Hortonworks Data Platform) Clusters?
Mohamed Mehdi BEN AISSA, Big Data Practice Manager at FINAXYS and Big Data ITO at CACIB
For HDP Clusters, we suggest, in a first phase, different Disaster Recovery Plan solutions depending on the SLA (Service-level agreement): RPO (Recovery Point Objective), RTO (Recovery Time Objective). In a second phase, we focus more on the stretch cluster solution: the advantages, the drawbacks and the impact of this choice on the global architecture. Finally, we explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer ...) into the architecture.
Alibaba builds the data infrastructure with Apache Hadoop YARN since 2013, and till now it manages more than 10k nodes. In Alibaba, Hadoop YARN serves various systems such as search, advertising, and recommendation etc. It runs not just batch jobs, also streaming, machine learning, OLAP, and even online services that directly impact Alibaba’s user experience. To extend YARN’s ability to support such complex scenarios, we have done and leveraged a lot of YARN 3.x improvements. In this talk, you will find what are these improvements and how they helped to solve difficult problems in large production clusters.
This includes:
1. Highly improved performance with Capacity Scheduler’s async scheduling framework
2. Better placement decisions with node attributes, placement constraints
3. Better resource utilization with opportunistic containers
4. Introduce a load balancer to balance resource utilization
5. Generic resource types scheduling/isolation to manage new resources such as GPU and FPGA
In the presentation, we will further introduce how we build the entire ecosystem on top of YARN and how we keep evolving YARN’s ability to tackle the challenges brought by continuously increasing data and business in Alibaba.
Speakers
Weiwei Yang, Alibaba, Staff Software Engineer
Ren Chunde, Alibaba Group, Senior Engineer
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
This document summarizes an intern's report on their internship at Altisource working with big data on the Hadoop platform. The intern researched how data is stored in RDBMS versus Hadoop, learned query languages like HiveQL and MySQL, and gained knowledge of MapReduce, Spark, and Sqoop. As part of a project, the intern analyzed Altisource data using Apache Hadoop and Spark. The intern also set up a 3-node cluster on Google Cloud and used it to analyze NASA clickstream and other data, proving concepts like linear increase in processing time with data size. The intern concluded they gained exposure to querying languages and distributed computing concepts during the internship.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
This document provides an introduction to project management. It defines a project, compares projects and operations, and outlines what makes a project successful or fail. It then defines project management and its key areas including scope, issue, cost, quality, communications, risk, and change management. The five phases of project management are also outlined. Finally, it discusses common project management tools and the role of the project manager.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
On the need for applications aware adaptive middleware in real-time RDF data ...Zia Ush Shamszaman
The document proposes an adaptive middleware approach for real-time RDF data analytics. It discusses how different RSP engines have different query languages, data models, execution strategies, and output models. It hypothesizes that an adaptive approach could improve efficiency and correctness by adapting to dynamic application requirements and data stream properties at runtime. It provides examples of events with different notification timing requirements to illustrate the need for the adaptive approach.
This document provides an overview of data science work at Zillow. It discusses Zillow's use of machine learning models like the Zestimate and Rent Zestimate to analyze housing data. It describes Zillow's technology stack, which heavily leverages Python, R, and SQL. Specific examples are provided on automated waterfront determination using GIS data and discovering home street features. The document also discusses how tools like Dato and Scikit-Learn are used for tasks like fraud detection, property matching, and data modeling. In closing, current job openings at Zillow are listed.
This document discusses benchmarking Apache Druid using the Star Schema Benchmark (SSB). It describes ingesting SSB data into Druid, optimizing queries and segments, running queries using JMeter, and the results. Key aspects covered include partitioning data, controlling segment size, explaining query plans, and configuring JMeter. The document encourages readers to try benchmarking Druid themselves to better learn how to optimize it for their own use cases and data.
This document discusses benchmarking Apache Druid using the Star Schema Benchmark (SSB). It describes ingesting the SSB dataset into Druid, optimizing the data and queries, and running performance tests on the 13 SSB queries using JMeter. The results showed Druid can answer the analytic queries in sub-second latency. Instructions are provided on how others can set up their own Druid benchmark tests to evaluate performance.
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
ExxonMobil leveraged machine learning at scale using Databricks to extract insights from equipment maintenance logs and improve operations. The logs contained both structured and unstructured text data across a global fleet maintained in legacy systems, limiting traditional analysis. By ingesting and enriching over 60 million records using natural language processing, the system identified outliers, enabled capacity planning, and prioritized maintenance tasks, projected to save millions annually through more effective reliability and maintenance guidance.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
If your business is heavily dependent on the Internet, you may be facing an unprecedented level of network traffic analytics data. How to make the most of that data is the challenge. This presentation from Kentik VP Product and former EMA analyst Jim Frey explores the evolving need, the architecture and key use cases for BGP and NetFlow analysis based on scale-out cloud computing and Big Data technologies.
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
This document discusses design patterns for big data applications. It begins by defining what a design pattern is, then provides examples of patterns for different types of data volumes and query speeds. Common patterns like percolation and recommendation systems are explained. The document also discusses how to analyze big data applications to determine which patterns may apply. Specific examples like personalized search, medicine, and market segmentation are used to illustrate how patterns can be implemented. The key lessons are to take a high-level view of recurring problems and design reusable pattern-based solutions.
An empirical evaluation of cost-based federated SPARQL query Processing EnginesUmair Qudus
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation
engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation
engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources
selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the
accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the
effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this
challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query
engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using
LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful
for the development of future cost-based federated SPARQL query processing engines.
EnterpriseDB's Best Practices for Postgres DBAsEDB
This document provides an agenda and overview for a presentation on best practices for PostgreSQL database administrators (DBAs). The presentation covers EnterpriseDB's expertise in PostgreSQL, the key responsibilities of a PostgreSQL DBA including monitoring, maintenance, capacity planning and configuration tuning. It also discusses deployment planning, professional development resources, and takes questions. Examples from architectural health checks and remote DBA services illustrate common issues found like index bloat and lack of backups. The document recommends performance monitoring and security tools and techniques for PostgreSQL.
Present & Future of Greenplum Database A massively parallel Postgres Database...VMware Tanzu
Greenplum Database is Pivotal's massively parallel Postgres database. Version 5 has proven features for mission critical use cases. Version 6 adds improvements like row-level locking, foreign data wrappers, and online expansion to make Greenplum a superset of Postgres. It also provides up to 50x faster OLTP performance. Version 7 will focus on capabilities beyond the cluster like streaming replication and using Greenplum as a source for data integration tools.
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleSaurabh Verma
This document summarizes a company's transition from a SQL database to a native graph database to power their identity resolution product. It describes the requirements of high read and write throughput and complex queries over billions of identities and linkages. It then outlines the evaluation of several graph databases, with JanusGraph on ScyllaDB performing the best. Key findings from prototyping include handling high query volume, managing supernodes, and tuning compaction strategies. The production implementation and architecture is also summarized.
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
This document discusses Wipro's experience helping a customer transition from their existing SIEM platform to Splunk for security monitoring and analytics. It describes how Wipro guided the customer through a two-phase implementation: first standing up a hybrid on-premise/cloud Splunk deployment to address immediate needs, and now expanding that deployment to 500GB/day in Splunk Cloud and 200GB/day on-premise to accommodate growing data and use cases. The transition yielded significant improvements in search performance, data ingestion and parsing flexibility, and enhanced security visualization and analytics capabilities.
The document summarizes a performance evaluation of the Geo2Tag location-based services platform. It describes modeling client-server interactions to identify the most frequent requests, measuring those requests' performance, and optimizing the platform. Specifically, it found database interaction as the bottleneck, optimized database synchronization, and saw average request processing times decrease by 47.5% after optimizations. The evaluation provided insights to maximize performance and informed future work on supporting NoSQL databases and lock-free algorithms.
Using Perforce Data in Development at TableauPerforce
Data plays a big role at Tableau—not just for our customers, but also throughout our company. Using our own products is not only one of our fundamental company values, but the analysis and discoveries we make are important to track as they shape our development processes and influence our day-to-day decisions. In this talk, we present and analyze a variety of data visualizations based on Perforce data from our development organization and share how it has influenced our infrastructure and development practices.
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
More Related Content
Similar to TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions
On the need for applications aware adaptive middleware in real-time RDF data ...Zia Ush Shamszaman
The document proposes an adaptive middleware approach for real-time RDF data analytics. It discusses how different RSP engines have different query languages, data models, execution strategies, and output models. It hypothesizes that an adaptive approach could improve efficiency and correctness by adapting to dynamic application requirements and data stream properties at runtime. It provides examples of events with different notification timing requirements to illustrate the need for the adaptive approach.
This document provides an overview of data science work at Zillow. It discusses Zillow's use of machine learning models like the Zestimate and Rent Zestimate to analyze housing data. It describes Zillow's technology stack, which heavily leverages Python, R, and SQL. Specific examples are provided on automated waterfront determination using GIS data and discovering home street features. The document also discusses how tools like Dato and Scikit-Learn are used for tasks like fraud detection, property matching, and data modeling. In closing, current job openings at Zillow are listed.
This document discusses benchmarking Apache Druid using the Star Schema Benchmark (SSB). It describes ingesting SSB data into Druid, optimizing queries and segments, running queries using JMeter, and the results. Key aspects covered include partitioning data, controlling segment size, explaining query plans, and configuring JMeter. The document encourages readers to try benchmarking Druid themselves to better learn how to optimize it for their own use cases and data.
This document discusses benchmarking Apache Druid using the Star Schema Benchmark (SSB). It describes ingesting the SSB dataset into Druid, optimizing the data and queries, and running performance tests on the 13 SSB queries using JMeter. The results showed Druid can answer the analytic queries in sub-second latency. Instructions are provided on how others can set up their own Druid benchmark tests to evaluate performance.
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
ExxonMobil leveraged machine learning at scale using Databricks to extract insights from equipment maintenance logs and improve operations. The logs contained both structured and unstructured text data across a global fleet maintained in legacy systems, limiting traditional analysis. By ingesting and enriching over 60 million records using natural language processing, the system identified outliers, enabled capacity planning, and prioritized maintenance tasks, projected to save millions annually through more effective reliability and maintenance guidance.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
If your business is heavily dependent on the Internet, you may be facing an unprecedented level of network traffic analytics data. How to make the most of that data is the challenge. This presentation from Kentik VP Product and former EMA analyst Jim Frey explores the evolving need, the architecture and key use cases for BGP and NetFlow analysis based on scale-out cloud computing and Big Data technologies.
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
This document discusses design patterns for big data applications. It begins by defining what a design pattern is, then provides examples of patterns for different types of data volumes and query speeds. Common patterns like percolation and recommendation systems are explained. The document also discusses how to analyze big data applications to determine which patterns may apply. Specific examples like personalized search, medicine, and market segmentation are used to illustrate how patterns can be implemented. The key lessons are to take a high-level view of recurring problems and design reusable pattern-based solutions.
An empirical evaluation of cost-based federated SPARQL query Processing EnginesUmair Qudus
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation
engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation
engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources
selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the
accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the
effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this
challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query
engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using
LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful
for the development of future cost-based federated SPARQL query processing engines.
EnterpriseDB's Best Practices for Postgres DBAsEDB
This document provides an agenda and overview for a presentation on best practices for PostgreSQL database administrators (DBAs). The presentation covers EnterpriseDB's expertise in PostgreSQL, the key responsibilities of a PostgreSQL DBA including monitoring, maintenance, capacity planning and configuration tuning. It also discusses deployment planning, professional development resources, and takes questions. Examples from architectural health checks and remote DBA services illustrate common issues found like index bloat and lack of backups. The document recommends performance monitoring and security tools and techniques for PostgreSQL.
Present & Future of Greenplum Database A massively parallel Postgres Database...VMware Tanzu
Greenplum Database is Pivotal's massively parallel Postgres database. Version 5 has proven features for mission critical use cases. Version 6 adds improvements like row-level locking, foreign data wrappers, and online expansion to make Greenplum a superset of Postgres. It also provides up to 50x faster OLTP performance. Version 7 will focus on capabilities beyond the cluster like streaming replication and using Greenplum as a source for data integration tools.
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleSaurabh Verma
This document summarizes a company's transition from a SQL database to a native graph database to power their identity resolution product. It describes the requirements of high read and write throughput and complex queries over billions of identities and linkages. It then outlines the evaluation of several graph databases, with JanusGraph on ScyllaDB performing the best. Key findings from prototyping include handling high query volume, managing supernodes, and tuning compaction strategies. The production implementation and architecture is also summarized.
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
This document discusses Wipro's experience helping a customer transition from their existing SIEM platform to Splunk for security monitoring and analytics. It describes how Wipro guided the customer through a two-phase implementation: first standing up a hybrid on-premise/cloud Splunk deployment to address immediate needs, and now expanding that deployment to 500GB/day in Splunk Cloud and 200GB/day on-premise to accommodate growing data and use cases. The transition yielded significant improvements in search performance, data ingestion and parsing flexibility, and enhanced security visualization and analytics capabilities.
The document summarizes a performance evaluation of the Geo2Tag location-based services platform. It describes modeling client-server interactions to identify the most frequent requests, measuring those requests' performance, and optimizing the platform. Specifically, it found database interaction as the bottleneck, optimized database synchronization, and saw average request processing times decrease by 47.5% after optimizations. The evaluation provided insights to maximize performance and informed future work on supporting NoSQL databases and lock-free algorithms.
Using Perforce Data in Development at TableauPerforce
Data plays a big role at Tableau—not just for our customers, but also throughout our company. Using our own products is not only one of our fundamental company values, but the analysis and discoveries we make are important to track as they shape our development processes and influence our day-to-day decisions. In this talk, we present and analyze a variety of data visualizations based on Perforce data from our development organization and share how it has influenced our infrastructure and development practices.
Similar to TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions (20)
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
This document introduces a unified framework for generalizing explanations for answers and non-answers to why/why-not questions over union of conjunctive queries (UCQs). It utilizes an available ontology, expressed as inclusion dependencies, to map concepts to instances and generate generalized explanations. Generalized explanations describe subsets of an explanation using concepts from the ontology. The most general explanation is the one that is not dominated by any other explanation. The approach is implemented using Datalog rules to model subsumption checking, successful and failed rule derivations, and computing explanations, their generalization, and the most general explanations.
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance- aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Explaining why an answer is present (traditional provenance) or absent (why-not provenance) from a query result is important for many use cases. Most existing approaches for positive queries use the existence (or absence) of input data to explain a (missing) answer. However, for realistically-sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that logical constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a why- or why-not provenance question. For instance, consider a query that returns the names of all cities which can be reached with at most one transfer via train from Lyon in France. The provenance of a city in the result of this query, say Dijon, will contain a large number of train connections between Lyon and Dijon which each justify the existence of Dijon in the result. If we are aware that Lyon and Dijon are cities in France (e.g., an ontology of geographical locations is available), then we can use this information to generalize the query output and its provenance to provide a more concise explanation of why Dijon is in the result. For instance, we may conclude that all cities in France can be reached from each other through Paris. We demonstrate how an ontology expressed as inclusion dependencies can provide meaningful justifications for answers and non-answers, and we outline how to find a most general such explanation for a given UCQ query result using Datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization.
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance, and evaluate the resulting system through extensive experiments.
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
This document discusses big data provenance and its implications for benchmarking. It begins by outlining provenance, describing challenges of big data provenance, and providing examples of approaches taken. It then discusses how provenance could be used for benchmarking by serving as data and workloads. Provenance-based metrics and using provenance for profiling and monitoring systems are proposed. Generating large datasets and workloads from provenance data is suggested to address issues with big data benchmarking.
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
TaPP 2013 - Provenance for Data MiningBoris Glavic
Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.
This document discusses auditing and maintaining provenance in software packages. It presents CDE-SP, an enhancement to the CDE system that captures additional details about software dependencies to enable attribution of authorship as software packages are combined and merged into pipelines. CDE-SP uses a lightweight LevelDB storage to encode process and file provenance within software packages. It provides queries to retrieve dependency information and validate authorship by matching provenance graphs. Experiments show CDE-SP introduces negligible overhead compared to the original CDE system.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions
1. A Generic Provenance Middleware
for Database Queries, Updates, and
Transactions
Bahareh Sadat Arab1, Dieter Gawlick2, Venkatesh
Radhakrishnan2, Hao Guo1, Boris Glavic1
IIT DBGroup1
Oracle2
2. Outline
❶ Motivation and Overview
❷ GProM Vision
❸ Provenance for Transactions
2 GProM - Provenance for Queries, Updates, and Transactions
3. Introduction
• Data Provenance
– Information about the origin and creation process data
• Provenance tracking for database operations
– Considerable interest from database community in last decade
• The de-facto standard for database provenance [1,2,3,4,5]
– model provenance as annotations on data (e.g., tuples)
– compute the provenance by propagating annotations (query rewrite)
SELECT
DISTINCT Owner
FROM CannAcc;
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, Springer, 2013.
[2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.
[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 14(4):373–396, 2005.
[4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty,
and Lineage. In VLDB, pages 1151–1154, 2006.
[5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.
3 GProM - Provenance for Queries, Updates, and Transactions
4. Use Cases
• Debugging data and transformations (queries)[1]
• Probabilistic databases (queries)[5]
• Auditing and compliance (transactions and update
statements)[6]
• Understanding data integration transformations (queries
and transactions)
• Assessing data quality and trust (queries and
transactions)[7]
Computing provenance for updates and transactions is
essential for many use cases.
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance
Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System
for Data, Uncertainty, and Lineage. In VLDB, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012.
4 GProM - Provenance for Queries, Updates, and Transactions
5. Shortcomings of State-of-the-Art
• No practical implementation for updates
• No system or model supports transactions
• Inflexible provenance storage
– Always on [2,3]
– On-demand only [1]
• Query rewrite use atypical access patterns and
operator sequences
– -> leads to poor execution plans
• Most systems: only one type of provenance
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pringer, 2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.
5 GProM - Provenance for Queries, Updates, and Transactions
6. Objectives
1. Vision: Generic Provenance Database
Middleware (GProM).
– Provenance for
• Queries, updates, and transactions
– User decides when to compute and store
provenance
– Supports multiple provenance models
– Database-independent
2. Tracking provenance of concurrent
transactions
– Reenactment Queries
6 GProM - Provenance for Queries, Updates, and Transactions
7. Contributions
1. First solution for provenance of transactions
2. Retroactive on-demand provenance
computation
– Using read-only reenactment
3. Only requires audit log + time travel
– Supported by most DBMS
– No additional storage and runtime overhead
4. Non-invasive provenance computation
– query rewrite + annotation propagation
7 GProM - Provenance for Queries, Updates, and Transactions
8. Outline
❶ Motivation and Overview
❷ GProM Vision
❸ Provenance for Transactions
8 GProM - Provenance for Queries, Updates, and Transactions
9. System Architecture
• Database independent middleware
– Plug-able parser and SQL code generator
• Internal query representation
– Relational Algebra Graph Model (AGM)
• Core driver: Query rewrites
– Provenance Computation
– Flexible storage policies for provenance
– Provenance import/export
– AGM Optimizer (rewritten queries)
– Extensibility: Rewrite Specification Language (RSL)
• Initial prototype build on-top of Oracle
9 GProM - Provenance for Queries, Updates, and Transactions
11. Provenance Computation
• Query rewrite
– Take original query q and rewrite into q+
Computes original results + provenance
– Propagate provenance through operations
11 GProM - Provenance for Queries, Updates, and Transactions
Q
Result
DB
Result +
Provenance
Q+
12. Example Rewrite
• Input:
SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID;
• Rewrite Parts:
USacc SELECT ID, Owner, Balance, Type,
ID AS P1, Owner AS P2, Balance AS P3, Type AS P4
FROM USacc
CanAcc SELECT ID, Owner, Balance, Type,
ID AS P5, Owner AS P6, Balance AS P7, Type AS P8
FROM CanAcc
WHERE u.ID = c.ID WHERE u.ID = c.ID
SELECT DISTINCT Owner SELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8
• Output:
SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8
FROM
(SELECT ID, Owner, Balance, Type,
ID AS P1, Owner AS P2, Balance AS P3, Type AS P4
FROM USacc) u
(SELECT ID, Owner, Balance, Type,
ID AS P5, Owner AS P6, Balance AS P7, Type AS P8
FROM CanAcc) c
WHERE u.ID = c.ID;
12 GProM - Provenance for Queries, Updates, and Transactions
13. Provenance Computation
• Operates on relational algebra representation of queries
– Fixed set of rewrite rules per provenance type:
• One per type of algebra operator
• Recursive top-down rewrite
– For each relation access: duplicate attributes as provenance
– For each operator: replace with algebra graph that propagates
provenance annotations
• Composable
13 GProM - Provenance for Queries, Updates, and Transactions
UsAcc CanAcc UsAcc CanAcc
14. Supporting Past Queries, Updates,
and Transactions
• Only needs audit log and time travel
–supported by most DBMS
• Sufficient for provenance of past queries [4]
• Our contribution
–Sufficient for updates and transactions
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010.
14 GProM - Provenance for Queries, Updates, and Transactions
15. Provenance Generation and
Storage Policies
• GProM default
– Only compute provenance if explicitly requested
• User can register storage policies
– When to store which type of provenance
POLICY storeOnR {
FIRE ON Query, Insert q
WHEN Root(q) +=> Table(R)
COMPUTE PI-CS
STORE AS NEW TABLE
NAMING SCHEME Hash
}
15 GProM - Provenance for Queries, Updates, and Transactions
16. Optimizing Rewritten Queries
• Query rewrite use atypical access patterns and
operator sequences
leads to poor execution plans
• Optimization for rewritten queries
– Heuristic
– Cost-based
SELECT ID, Owner, Balance,
CASE
WHEN Balance > 1000000
THEN 'Premium '
ELSE Type
END AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
...
SELECT ID, Owner, Balance, 'Premium ' AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
WHERE Balance > 1000000
UNION ALL
SELECT * FROM u1
WHERE (Balance > 1000000) IS NOT TRUE
16 GProM - Provenance for Queries, Updates, and Transactions
17. Rewrite Extensibility
• Extensible using Rewrite Specification Language (RSL)
– Concise specification of rewrite rules
RULE mergeSelections {
FOR q => c => g
WHERE q->type = selection AND c->type = selection
REWRITE INTO
selection [pred = q->pred AND c->pred] => g
}
17
User
RSL
Manager
1
RSL
Provenance
Rewriter
PolicyPolicyRSL
2
RSL
Interpreter
3
1
2 3
4
GProM - Provenance for Queries, Updates, and Transactions
18. Outline
❶ Motivation and Overview
❷ GProM Vision
❸ Provenance for Transactions
18 GProM - Provenance for Queries, Updates, and Transactions
20. Provenance of Transactions
INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
UPDATE USacc
SET Type = ’Premium’
WHERE Balance > 1000000;
COMMIT;
20 GProM - Provenance for Queries, Updates, and Transactions
21. Provenance of Transactions
INSERT INTO Usacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
21 GProM - Provenance for Queries, Updates, and Transactions
u1
u2
22. Provenance of Transactions
• Our Approach:
Reenactment + Provenance Propagation
• Currently supports
– Snapshot Isolation
– Statement-level Snapshot Isolation
22 GProM - Provenance for Queries, Updates, and Transactions
Gather
Transaction
Information
Construct
Update
Reenactment
Query
Rewrite For
Provenance
Computation
Execute
Query
1
Construct
Transaction
Reenactment
Query
2 3 4 5
23. 1.Gather Transaction Information
• Retrieve SQL statements of transaction from audit log
• Update u1:
INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
• Update u2:
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
23 GProM - Provenance for Queries, Updates, and Transactions
24. 2. Translate Updates: Reenactment
• Update reads table version and outputs updated table version
• Multiple versions of the database
– Each modification of a tuple t causes a new version to be created
– Old tuple versions are kept (SI)
– Add version annotation τ to provenance of each updated row
• Use semi-ring model
24 GProM - Provenance for Queries, Updates, and Transactions
UPDATE Usacc
SET Type=’Premium’
WHERE Balance>1000000;
25. 2.Translate Updates
• Construct update reenactment query
– Simulates effect of update
– Read DB version seen by update using time travel
– Query result = updated table (Annotation-Equivalent)
SELECT ID, Owner, Balance, ’Standard’ AS Type
FROM CanAcc AS OF SCN 3652
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652;
25 GProM - Provenance for Queries, Updates, and Transactions
UPDATE Usacc
SET Type = ’Premium’
WHERE Balance > 1000000;
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM Usacc AS OF SCN 3652
WHERE Balance>1000000
UNION ALL
SELECT *
FROM Usacc AS OF SCN 3652
WHERE (Balance>1000000) IS NOT TRUE;
INSERT INTO Usacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
26. 3. Construct Reenactment Query
• Simulates the whole transaction
– Annotation-Equivalent to original transaction
• Merge reenactment queries based on concurrency control protocol
– Each concurrency control requires a different merge process
– SERIALIZABLE (Snapshot isolation) -> modifications before the
transaction started + previous updates of the transaction
– READ COMMITTED (Snapshot isolation) -> sees committed changes
by concurrent transaction
WHIT U1 AS
(SELECT ID, Owner, Balance, ’Standard’ AS Type
FROM CanAcc AS OF SCN 3652
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652);
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM U1
WHERE Balance>1000000
UNION ALL
SELECT * FROM U1
WHERE (Balance>1000000) IS NOT TRUE;
26 GProM - Provenance for Queries, Updates, and Transactions
27. 4. Rewrite For Provenance
Computation
• Rewrite reenactment query to compute
provenance using annotation propagation
WITH
u1 AS
(SELECT ID, Owner, Balance, ’Standard ’ AS Type,
ID AS prov_CanAcc_ID,
. . .
NULL AS prov_USacc_ID,
. . .
1 AS updated,
FROM CanAcc AS OF SCN 3652
WHERE Type = ’US dollar ’
UNION ALL
SELECT ID , Owner , Balance , Type ,
NULL AS prov_CanAcc_ID,
. . .
ID AS prov_USacc_ID,
. . .
0 AS updated
FROM USacc AS OF SCN 3652),
. . .
u1 AS
(SELECT . . .
27 GProM - Provenance for Queries, Updates, and Transactions
28. 4. Execute Query
• Execute query to retrieve provenance
Updated USacc Tuples Provenance from CanAcc Provenance from USacc
ID Owner Balance Type P1 P2 P3 P4 P5 P6
3 Alice Bright 1,500,000 Premium 3 Alice Bright 1,500,000 NULL NULL NULL
5 Mark Smith 50 Standard 5 Mark Smith 50 NULL NULL NULL
28 GProM - Provenance for Queries, Updates, and Transactions
29. Conclusions
• We present our vision for GProM
– Database-independent middleware for computing
provenance of queries, updates, and transactions.
• First solution for provenance of transactions
• Query rewrite techniques on steroids:
– Provenance computation
– Transaction reenactment
– Provenance translation
– Provenance storage
– Optimization
• Extensible through RSL language
29 GProM - Provenance for Queries, Updates, and Transactions
30. Future Works
• Implementing additional provenance types
• Comprehensive study of heuristic and cost-based
optimizations
• Design and implementation of RSL
• Implementing additional provenance formats
• Study reenactment for other concurrency control
mechanisms
– Locking protocols (2PL)
• Investigate additional Use-cases for Reenactment
– Transaction backout
– Retroactive What-if analysis
30 GProM - Provenance for Queries, Updates, and Transactions
32. References
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient
Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pages 291–320. Springer,
2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An
Annotation Management System for Relational Databases. VLDB Journal,
14(4):373–396, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative
data sharing via update exchange and provenance. TODS, 38(3): 19, 2013.
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311–
322, 2010.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T.
Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and
Lineage. In VLDB, pages 1151–1154, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using
temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and
provenance. SIGMOD Record, 41(3):5–14, 2012.
32
33. Q-Bomb
• One pattern that arises from reenactment are long chains of
SELECT clauses using CASE
– Each level references attributes from next level multiple times
– Subquery pull-up creates expressions of size exponential in the number
of SELECT clauses
– In praxis: optimization never finishes
• Minimal example using one row table
SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
…
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM R
33
39. Types of Update Operations - Insert
• Insert executed at time t
• Updated version of R contains
1. All tuples from previous version
2. All newly inserted tuples
• Fixed tuple defined in VALUES clause
• Results of query over database version at t
Union these two sets
INSERT INTO R VALUES (v1, ... ,vn);
INSERT INTO R (q);
39
(SELECT * FROM R AS OF t)
UNION ALL
(SELECT v1 AS a1, ... , vn AS an);
(SELECT * FROM R AS OF t)
UNION ALL
(q(t));
40. Types of Update Operations - Delete
• Delete executed at time t
• Tuples in updated version of R:
– All tuples from for which Condition is not
fulfilled
DELETE FROM R WHERE C ; SELECT * FROM R AS OF t
WHERE (C) IS NOT TRUE;
40
41. Types of Update Operations - Update
• Update executed at time t
• Find tuples where Condition holds and update
the attribute values
• Find tuples where NOT Condition holds
Union these two sets
UPDATE R SET A WHERE C ;
(SELECT A’ FROM R AS OF t WHERE C)
UNION ALL
(SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE)
41
42. READ COMMITTED
• Statement of a transaction T sees committed changes by concurrent
transaction
• For a given update we need to combine
– tuples produced by previous statements of same transaction
– tuples produced by transactions that committed before update
• Observations
– Once a transaction T modifies a tuple t, no other transaction can access t until T
commits
– Let ui be the update executed at time x of T that first modifies t
– ui will read the latest version committed x
– If we know ui then updates of T before x do not have to look at t
• Consider the database version 1 time unit (C-1) before commit of T
– This contains all the tuple versions seen by the first update of T updating each
individual tuple
– Let t be a tuple version in this version and it’s start time is y
– We know that updates from T which executed before y cannot have updated t
– We can use version C-1 as input for reenactment as long as we hide tuple
version t at y from an reenactment of an updated executed at x with x < y
42
43. READ COMMITTED
u1 AS
(SELECT
CASE WHEN Balance <=1000000 AND version <= 0 THEN 'Standard ' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance <=1000000 AND version <= 0 THEN −1 ELSE version END AS version
FROM USacc AS OF SCN 3652)
,
u2 AS
(SELECT
CASE WHEN Balance > 1000000 AND version <= 1 THEN 'Premium' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance > 1000000 AND version <= 1 THEN −1 ELSE version END AS version
FROM u1 )
SELECT ID , Owner , Balance , Type FROM u2 WHERE version = −1;43
44. Database Independence
• Encapsulate database-specific functionality in
pluggable modules.
• What needs to be adapted are :
1) Parser
2) SQL code generator
3) Metadata access
4) Audit log access
5) Time travel activation.
44
45. Accessing Several Tables
• Transactions Accessing Several Tables
– We require user to specify which table she is
interested in
– Replace access to table with query for last update
that modified the table
45
U1R1
R2
R3 U2
R1
U3
R3
U4 R1
R3