A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient.
A talk given by Julian Hyde at 9th XLDB conference at SLAC, Menlo Park, on 2016/05/25.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient.
A talk given by Julian Hyde at 9th XLDB conference at SLAC, Menlo Park, on 2016/05/25.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
Migrating your data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost effective to analyze data using your existing business intelligence tools. AWS Database Migration Service and AWS Schema Conversion Tool make it easier to migrate your schema and data from your Oracle data warehouse to Amazon Redshift, without disrupting the applications that rely on the data source.
This talk gives an overview of Apache Drill architecture and describes specific features that enable it to get the best performance over NoSQL databases and distributed file systems. It discusses secondary index planning and execution for NoSQL databases and partition pruning and Parquet filter pushdown for distributed file systems.
Remember the last time you tried to write a MapReduce job (obviously something non trivial than a word count)? It sure did the work, but has lot of pain points from getting an idea to implement it in terms of map reduce. Did you wonder how life will be much simple if you had to code like doing collection operations and hence being transparent* to its distributed nature? Did you want/hope for more performant/low latency jobs? Well, seems like you are in luck.
In this talk, we will be covering a different way to do MapReduce kind of operations without being just limited to map and reduce, yes, we will be talking about Apache Spark. We will compare and contrast Spark programming model with Map Reduce. We will see where it shines, and why to use it, how to use it. We’ll be covering aspects like testability, maintainability, conciseness of the code, and some features like iterative processing, optional in-memory caching and others. We will see how Spark, being just a cluster computing engine, abstracts the underlying distributed storage, and cluster management aspects, giving us a uniform interface to consume/process/query the data. We will explore the basic abstraction of RDD which gives us so many awesome features making Apache Spark a very good choice for your big data applications. We will see this through some non trivial code examples.
Session at the IndicThreads.com Confence held in Pune, India on 27-28 Feb 2015
http://www.indicthreads.com
http://pune15.indicthreads.com
I will begin with a brief overview of SQL. Then the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...EDB
What do you do, when you have to deal with poor database and query performance in PostgreSQL and there is no one around to help? Let us introduce you to important commands in PostgreSQL - EXPLAIN and EXPLAIN ANALYZE. Knowing how to use these 'tools' will help you identify query performance bottlenecks and opportunities. It provides a query plan detailing what approach the planner took to execute the statement provided.
Attend this webinar to learn:
- What are EXPLAIN and EXPLAIN ANALYZE in PostgreSQL?
- How do they help?
- Know planner tuning parameters
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
In our Hadoop World 2012 talk "Performing Data Science with HBase", Aaron Kimball and Kiyan Ahmadizadeh demonstrated publicly for the first time our new WibiData Language (or "WDL" for short) -- a concise, powerful, and interactive tool for engineers and data scientists to explore data and write ad-hoc queries against datasets stored in WibiData.
Love Your Database. An interesting example of how to exploit Postgres' advanced types and of running code in the database. Take a step beyond stored procedures!
U-SQL Query Execution and Performance TuningMichael Rys
This 400 level presentation explains the U-SQL Query Execution in Azure Data Lake and provides several Performance Tuning tips: What tools are available and some best practices.
Building a semantic/metrics layer using CalciteJulian Hyde
A semantic layer, also known as a metrics layer, lies between business users and the database, and lets those users compose queries in the concepts that they understand. It also governs access to the data, manages data transformations, and can tune the database by defining materializations.
Like many new ideas, the semantic layer is a distillation and evolution of many old ideas, such as query languages, multidimensional OLAP, and query federation.
In this talk, we describe the features we are adding to Calcite to define business views, query measures, and optimize performance.
A talk given at Community over Code, the annual conference of the Apache Software Foundation, in Halifax, NS, on 9th October, 2023.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support metrics (which we call ‘measures’), filter context, and analytic expressions. With these concepts you can define data models (which we call Analytic Views) that contain metrics, use them in queries, and define new metrics in queries.
In this talk by the original developer of Apache Calcite, we describe the SQL syntax extensions for metrics,
and how to use them for cross-dimensional calculations such as period-over-period, percent-of-total,
non-additive and semi-additive measures. We describe how we got around fundamental limitations in SQL
semantics, and approaches for optimizing queries that use metrics.
A talk given by Julian Hyde at Data Council, Austin, TX, on March 29, 2023.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support measures, evaluation context, and multidimensional expressions. With these concepts you can define data models that contain measures, use them in queries, and define new measures in queries.
A talk given by Julian Hyde to Apache Calcite's virtual meetup, 2023-03-15.
Morel, a data-parallel programming languageJulian Hyde
What would the perfect data-parallel programming language look like? It would be as expressive as a general-purpose functional programming language, as powerful and concise as SQL, and run programs just as efficiently on a laptop or a thousand-node cluster.
We present Morel, a functional programming language with relational extensions, working towards that goal. Morel is implemented in the Apache Calcite community on top of Calcite’s relational algebra framework. In this talk, we describe Morel’s evolution, including how we are pushing Calcite’s capabilities with graph and recursive queries.
A talk given by Julian Hyde at ApacheCon, New Orleans, October 4th 2022.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
The evolution of Apache Calcite and its CommunityJulian Hyde
Apache Calcite is an open source framework for building databases, and includes a SQL parser, relational algebra, and a highly extensible query optimizer.
It has achieved wide adoption, used in many commercial products, open source projects, and as a test bed for computer science research.
But there is a bootstrap problem: If software is written by a community of contributors, and each contributor acts in their own self-interest, how do you get the first working version of the product? The answer is in the story of how the technology evolved, and how the community evolved with it, and in this talk we tell that story.
The Hop project entered Apache Software Foundation as an Incubator project in 2020, and Julian Hyde, one of their mentors, gave this presentation to educate the initial committers on the Apache Way and what to expect during the Incubation process.
The talk was given by Julian Hyde on October 1st, 2020, with the original title "Apache Incubation - What's it all about?"
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.
A talk given by Eli Levine and Julian Hyde at Apache: Big Data, Miami, on May 17th, 2017.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet?
Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Spatial Query on Vanilla Databases
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like
r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people
are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as
HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements
much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes.
Calcite is bringing spatial query to the masses!
5. Apache Calcite
Apache top-level project since 2015
Query planning framework used in many
projects and products
Also works standalone: embedded federated
query engine with SQL / JDBC front end
Apache community development model
https://calcite.apache.org
https://github.com/apache/calcite
6. SELECT d.name, COUNT(*) AS c
FROM Emps AS e
JOIN Depts AS d USING (deptno)
WHERE e.age < 40
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Relational algebra
Based on set theory, plus operators:
Project, Filter, Aggregate, Union, Join,
Sort
Requires: declarative language (SQL),
query planner
Original goal: data independence
Enables: query optimization, new
algorithms and data structures
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
7. SELECT d.name, COUNT(*) AS c
FROM (SELECT * FROM Emps
WHERE e.age < 40) AS e
JOIN Depts AS d USING (deptno)
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Algebraic rewrite
Optimize by applying rewrite rules that
preserve semantics
Hopefully the result is less expensive;
but it’s OK if it’s not (planner keeps
“before” and “after”)
Planner uses dynamic programming,
seeking the lowest total cost
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
9. A spatial query
Find all restaurants within 1.5 distance units of
my location (6, 7)
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
10. A spatial query
Find all restaurants within 1.5 distance units of
my location (6, 7)
Using OpenGIS SQL extensions:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
11. Simple implementation
Using ESRI’s geometry-api-java library,
almost all ST_ functions were easy to
implement in Calcite.
Slow – one row at a time.
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
package org.apache.calcite.runtime;
import com.esri.core.geometry.*;
/** Simple implementations of built-in geospatial functions. */
public class GeoFunctions {
/** Returns the distance between g1 and g2. */
public static double ST_Distance(Geom g1, Geom g2) {
return GeometryEngine.distance(g1.g(), g2.g(), g1.sr());
}
/** Constructs a 2D point from coordinates. */
public static Geom ST_MakePoint(double x, double y) {
final Geometry g = new Point(x, y);
return new SimpleGeom(g);
}
/** Geometry. It may or may not have a spatial reference
* associated with it. */
public interface Geom {
Geometry g();
SpatialReference sr();
Geom transform(int srid);
Geom wrap(Geometry g);
}
static class SimpleGeom implements Geom { … }
}
12. Traditional DB indexing
techniques don’t work
Sort
Hash
CREATE /* b-tree */ INDEX
I_Restaurants
ON Restaurants(x, y);
CREATE TABLE Restaurants(
restaurant VARCHAR(20),
x INTEGER,
y INTEGER)
PARTITION BY (MOD(x + 5279 * y, 1024));
•
•
•
•
A scan over a two-dimensional index only
has locality in one dimension
14. Spatial data structures and algorithms
The challenge: Reduce dimensionality while preserving locality
● Reduce dimensionality – We want to warp the information space so that
we can access on one composite attribute rather than several
● Preserve locality – If two items are close in 2D, we want them to be close
in the information space (and in the same cache line or disk block)
Two main approaches to spatial data structures:
● Data-oriented
● Space-oriented
26. Spatial query
Find all restaurants within 1.5 distance units of
where I am:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
27. Hilbert space-filling curve
● A space-filling curve invented by mathematician David Hilbert
● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
28. •
•
•
•
Add restriction based on h, a restaurant’s distance
along the Hilbert curve
Must keep original restriction due to false positives
Using Hilbert index
restaurant x y h
Zachary’s pizza 3 1 5
King Yen 7 7 41
Filippo’s 7 4 52
Station burger 5 6 36
Zachary’s
pizza
Filippo’s
SELECT *
FROM Restaurants AS r
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46)
AND ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
King
Yen
Station
burger
29. Telling the optimizer
1. Declare h as a generated column
2. Sort table by h
Planner can now convert spatial range
queries into a range scan
Does not require specialized spatial
index such as r-tree
Very efficient on a sorted table such as
HBase
CREATE TABLE Restaurants (
restaurant VARCHAR(20),
x DOUBLE,
y DOUBLE,
h DOUBLE GENERATED ALWAYS AS
ST_Hilbert(x, y) STORED)
SORT KEY (h);
restaurant x y h
Zachary’s pizza 3 1 5
Station burger 5 6 36
King Yen 7 7 41
Filippo’s 7 4 52
30. Algebraic rewrite
Scan [T]
Filter [ST_Distance(
ST_Point(T.X, T.Y),
ST_Point(x, y)) < d]
Scan [T]
Filter [(T.H BETWEEN h0 AND h1
OR T.H BETWEEN h2 AND h3)
AND ST_Distance(
ST_Point(T.X, T.Y),
ST_Point(x, y)) < d]
Constraint: Table T has a
column H such that:
H = Hilbert(X, Y)
FilterHilbertRule
x, y, d, hi
– constants
T – table
T.X, T.Y, T.H – columns
31. Variations on a theme
Several ways to say the same thing using OpenGIS functions:
● ST_Distance(ST_Point(X, Y), ST_Point(x, y)) < d
● ST_Distance(ST_Point(x, y), ST_Point(X, Y)) < d
● ST_DWithin(ST_Point(x, y), ST_Point(X, Y), d)
● ST_Contains(ST_Buffer(ST_Point(x, y), d), ST_Point(X, Y))
Other patterns can use Hilbert functions:
● ST_DWithin(ST_MakeLine(ST_Point(x1, y1), ST_Point(x2, y2)),
ST_Point(X, Y), d)
● ST_Contains(ST_PolyFromText('POLYGON((0 0,20 0,20 20,0 20,0 0))'),
ST_Point(X, Y), d)
32. More spatial queries
What state am I in? (1-point-to-1-polygon)
Which states does Yellowstone NP intersect?
(1-polygon-to-many-polygons)
Which US national park intersects with the most
states? (many-polygons-to-many-polygons,
followed by sort/limit)
33. More spatial queries
What state am I in? (point-to-polygon)
Which states does Yellowstone NP intersect?
(polygon-to-polygon)
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_MakePoint(6, 7))
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_GeomFromText('LINESTRING(...)'))
34. Tile index
Idaho
Montana
Nevada Utah Colorado
Wyoming
We cannot use space-filling curves, because
each region (state or park) is a set of points and
not known as planning time.
Divide regions into a (coarse) set of tiles. They
intersect only if some of their tiles intersect.
36. Aside: Materialized views
CREATE MATERIALIZED
VIEW EmpSummary AS
SELECT deptno, COUNT(*) AS c
FROM Emp
GROUP BY deptno;
Scan [Emps]
Scan
[EmpSummary]
Aggregate [deptno, count(*)]
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno c
10 2
20 1
30 1
A materialized view is a table
that is defined by a query
The planner knows about the
mapping and can
transparently rewrite queries
to use it
37. Building the tile index
Use the ST_MakeGrid function to decompose each
state into a series of tiles
Store the results in a table, sorted by tile id
A materialized view is a table that remembers how
it was computed, so the planner can rewrite queries
to use it
CREATE MATERIALIZED VIEW StateTiles AS
SELECT s.stateId, t.tileId
FROM States AS s,
LATERAL TABLE(ST_MakeGrid(s.geometry, 4, 4)) AS t
6 7 8
3 4 5
0 1 2
Idaho
Montana
Nevada Utah Colorado
Wyoming
38. Point-to-polygon query
What state am I in? (point-to-polygon)
1. Divide the plane into tiles, and pre-compute
the state-tile intersections
2. Use this ‘tile index’ to narrow list of states
SELECT s.*
FROM States AS s
WHERE s.stateId IN (SELECT stateId
FROM StateTiles AS t
WHERE t.tileId = 8)
AND ST_Intersects(s.geometry, ST_MakePoint(6, 7))
6 7 8
3 4 5
0 1 2
Idaho
Montana
Nevada Utah Colorado
Wyoming
39. Algebraic rewrite
Scan [S]
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)]
SemiJoin [S.stateId = T.stateId]
Constraint #1: There is a table “Tiles” defined by
SELECT s.stateId, t.tileId FROM States AS s,
LATERAL TABLE(ST_MakeGrid(s.geometry, x, y)) AS t
Constraint #2: stateId is primary key of S
TileSemiJoinRule
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)]
Scan [S] Filter [T.tileId = 8]
Scan [T]
40. Streaming + spatial
Example query: Every minute, emit the number of journeys that have intersected
each city. (Some journeys intersect multiple cities.)
(Efficient implementation is left as an exercise to the reader. Probably involves
splitting journeys into tiles, partitioning by tile hash-code, intersecting with cities
in those tiles, then rolling up cities.)
SELECT STREAM c.name, COUNT(*)
FROM Journeys AS j
CROSS JOIN Cities AS c
ON ST_Intersects(c.geometry, j.geometry)
GROUP BY c.name, FLOOR(j.rowtime TO HOUR)
41. Summary
Traditional DB techniques (sort, hash) don’t work for 2-dimensional data
Spatial presents tough design choices:
● Space-oriented vs data-oriented algorithms
● General-purpose vs specialized data structures
Relational algebra unifies traditional and spatial:
● Use general-purpose structures
● Compose techniques (transactions, analytics, spatial, streaming)
● Must use space-oriented algorithms, because their dimensionality-reducing
mapping is known at planning time
42. Thank you! Questions?
@ApacheCalcite | @julianhyde | https://calcite.apache.org
Resources & credits
● [CALCITE-1616] Data profiler
● [CALCITE-1870] Lattice suggester
● [CALCITE-1861] Spatial indexes
● [CALCITE-1968] OpenGIS
● [CALCITE-1991] Generated columns
● Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017)
● Talk: “SQL on everything, in memory” (Strata, 2014)
● Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial
Objects”
● Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes
efficiently”
● https://www.census.gov/geo/maps-data/maps/2000popdistribution.html
● https://www.nasa.gov/mission_pages/NPP/news/earth-at-night.html
46. Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
47. Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc