This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
This document discusses using Apache Calcite for cost-based query optimization in Apache Phoenix. It describes how Phoenix currently performs query optimization and how integrating with Calcite would allow for more advanced optimization based on statistics and cost modeling. Examples are provided of rules in Calcite that could improve Phoenix query plans, such as pushing filters into joins when applicable. The integration would improve SQL compliance and interoperability with other Calcite-powered systems like Apache Drill.
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix+Calcite integration.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Introduction to Apache Calcite. An overview of the core features that Apache Calcite provides to help build a database. The talk covers Calcite's SQL parser, cost-based optimizer, pluggable data sources, and JDBC-HTTP server "Avatica".
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
This document discusses using Apache Calcite for cost-based query optimization in Apache Phoenix. It describes how Phoenix currently performs query optimization and how integrating with Calcite would allow for more advanced optimization based on statistics and cost modeling. Examples are provided of rules in Calcite that could improve Phoenix query plans, such as pushing filters into joins when applicable. The integration would improve SQL compliance and interoperability with other Calcite-powered systems like Apache Drill.
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix+Calcite integration.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Introduction to Apache Calcite. An overview of the core features that Apache Calcite provides to help build a database. The talk covers Calcite's SQL parser, cost-based optimizer, pluggable data sources, and JDBC-HTTP server "Avatica".
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
This paper describes how the optimizer uses statistics and determines plans for executing SQL statement. It explains how the 10053 trace file can be used to understand Oracle's decisions on execution plans.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
Apache Calcite is an open source framework for building data management systems that allows for optimized query processing over heterogeneous data sources. It uses a flexible relational algebra and extensible adapter-based architecture that allows it to incorporate diverse data sources. Calcite's rule-based optimizer transforms logical query plans into efficient physical execution plans tailored for different data sources. It has been adopted by many projects and companies and is also used in research.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
This document discusses execution plans in Oracle databases. It begins by defining an execution plan as the detailed steps the optimizer uses to execute a SQL statement, expressed as database operators. It then covers how to generate plans using EXPLAIN PLAN or V$SQL_PLAN, what constitutes a good plan for the optimizer in terms of cost and performance, and key aspects of plans including cardinality, access paths, join types, and join order. Examples are provided to illustrate each concept.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. SQW has been published as an open source project. It is available as both a CLI and an embeddable jar with a simple query API. There are pre-built functions for windowing to do Ranking, Aggregation, Navigation and Linear Regression. There are Table functions to do Time Series Analysis, Allocations, and Data Densification. Functions can be chained for more complex analysis. Under the covers MR mechanics are used to partition and order data. The fundamental interface is the tableFunction, whose core job is to operate on data partitions. Function implemenations are isolated from MR mechanics, focus purely on computation logic. Groovy scripting can be used for core implementation and parameterizing behavior. Writing functions typically involves extending one of the existing Abstract functions.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
This is a paper I wrote at Hotsos where we used Method-R and Trace Data to optimize performance. SQL tuning can be simple if you ask the right questions.
This document presents a framework for stock analysis using Hadoop. It outlines using Hive and MapReduce to analyze stock data from the NYSE to obtain adjusted closing share prices after dividend distributions. The technical architecture involves using MapReduce programs and Hive tables. Code examples in Hive and MapReduce are provided to load and clean the data, perform inner joins, and calculate adjusted closing prices on dividend dates. The results found the adjusted closing prices and business implications include examining historical trends to encourage investment and show efficient company performance meeting shareholder expectations.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
The document discusses steps for building charts in Splunk, including preparing the data, visualizing the chart type, building the chart using the Splunk search language, automating the chart, and polishing the final visualization. It provides examples of different chart types (like line charts, pie charts, and tables) and search commands in Splunk for building various visualizations. Customization techniques are also covered, such as using CSS stylesheets and XML definitions to control chart properties.
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
This document summarizes Alan F. Gates' presentation on Pig, an open source tool for analyzing large datasets. Pig provides a high-level language called Pig Latin for expressing data analysis processes, which are executed as MapReduce jobs on a Hadoop cluster. Key points include:
- Pig Latin scripts are compiled into MapReduce jobs for parallel execution on Hadoop
- Pig supports common operations like load, filter, group, join, and store and integrates with user-defined functions
- Pig is used by many companies for large-scale data processing tasks like analyzing web server logs and building user behavior models
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
The document discusses algorithms and techniques for query processing and optimization in relational database management systems. It covers translating SQL queries into relational algebra, algorithms for operations like selection, projection, join and sorting, using heuristics and cost estimates for optimization, and an overview of query optimization in Oracle databases.
This paper describes how the optimizer uses statistics and determines plans for executing SQL statement. It explains how the 10053 trace file can be used to understand Oracle's decisions on execution plans.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
Apache Calcite is an open source framework for building data management systems that allows for optimized query processing over heterogeneous data sources. It uses a flexible relational algebra and extensible adapter-based architecture that allows it to incorporate diverse data sources. Calcite's rule-based optimizer transforms logical query plans into efficient physical execution plans tailored for different data sources. It has been adopted by many projects and companies and is also used in research.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
This document discusses execution plans in Oracle databases. It begins by defining an execution plan as the detailed steps the optimizer uses to execute a SQL statement, expressed as database operators. It then covers how to generate plans using EXPLAIN PLAN or V$SQL_PLAN, what constitutes a good plan for the optimizer in terms of cost and performance, and key aspects of plans including cardinality, access paths, join types, and join order. Examples are provided to illustrate each concept.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. SQW has been published as an open source project. It is available as both a CLI and an embeddable jar with a simple query API. There are pre-built functions for windowing to do Ranking, Aggregation, Navigation and Linear Regression. There are Table functions to do Time Series Analysis, Allocations, and Data Densification. Functions can be chained for more complex analysis. Under the covers MR mechanics are used to partition and order data. The fundamental interface is the tableFunction, whose core job is to operate on data partitions. Function implemenations are isolated from MR mechanics, focus purely on computation logic. Groovy scripting can be used for core implementation and parameterizing behavior. Writing functions typically involves extending one of the existing Abstract functions.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
This is a paper I wrote at Hotsos where we used Method-R and Trace Data to optimize performance. SQL tuning can be simple if you ask the right questions.
This document presents a framework for stock analysis using Hadoop. It outlines using Hive and MapReduce to analyze stock data from the NYSE to obtain adjusted closing share prices after dividend distributions. The technical architecture involves using MapReduce programs and Hive tables. Code examples in Hive and MapReduce are provided to load and clean the data, perform inner joins, and calculate adjusted closing prices on dividend dates. The results found the adjusted closing prices and business implications include examining historical trends to encourage investment and show efficient company performance meeting shareholder expectations.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
The document discusses steps for building charts in Splunk, including preparing the data, visualizing the chart type, building the chart using the Splunk search language, automating the chart, and polishing the final visualization. It provides examples of different chart types (like line charts, pie charts, and tables) and search commands in Splunk for building various visualizations. Customization techniques are also covered, such as using CSS stylesheets and XML definitions to control chart properties.
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
This document summarizes Alan F. Gates' presentation on Pig, an open source tool for analyzing large datasets. Pig provides a high-level language called Pig Latin for expressing data analysis processes, which are executed as MapReduce jobs on a Hadoop cluster. Key points include:
- Pig Latin scripts are compiled into MapReduce jobs for parallel execution on Hadoop
- Pig supports common operations like load, filter, group, join, and store and integrates with user-defined functions
- Pig is used by many companies for large-scale data processing tasks like analyzing web server logs and building user behavior models
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
The document discusses algorithms and techniques for query processing and optimization in relational database management systems. It covers translating SQL queries into relational algebra, algorithms for operations like selection, projection, join and sorting, using heuristics and cost estimates for optimization, and an overview of query optimization in Oracle databases.
This document discusses MongoDB's new aggregation framework, which provides a more performant and declarative way to perform data aggregation tasks compared to MapReduce. The framework includes pipeline operations like $match, $project, and $group that allow filtering, reshaping, and grouping documents. It also features an expression language for computed fields. The initial release will support aggregation pipelines and sharding, with future plans to add more operations and expressions.
SQL Server 2008 Development for ProgrammersAdam Hutson
The document outlines a presentation by Adam Hutson on SQL Server 2008 development for programmers, including an overview of CRUD and JOIN basics, dynamic versus compiled statements, indexes and execution plans, performance issues, scaling databases, and Adam's personal toolbox of SQL scripts and templates. Adam has 11 years of database development experience and maintains a blog with resources for SQL topics.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
This document provides an introduction to data analysis techniques using Python. It discusses key Python libraries for data analysis like NumPy, Pandas, SciPy, Scikit-Learn and libraries for data visualization like matplotlib and Seaborn. It covers essential concepts in data analysis like Series, DataFrames and how to perform data cleaning, transformation, aggregation and visualization on data frames. It also discusses statistical analysis, machine learning techniques and how big data and data analytics can work together. The document is intended as an overview and hands-on guide to getting started with data analysis in Python.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.
The document discusses polyalgebra, an extended form of relational algebra that can handle complex data types like nested records and streaming data. It allows various data processing engines and SQL query engines to operate over different data sources using a single optimization framework. The document outlines the ecosystem of data stores, engines, and frameworks that can be used with polyalgebra and Calcite's rule-based query planning system. It provides examples of how relational algebra expressions capture the logic of SQL queries and how rules are used to optimize query plans.
The document summarizes the development of business intelligence reports for a project. It involved creating dashboards using Performance Point Server (PPS) and publishing them to SharePoint. SQL Server Reporting Services (SSRS) reports were also created and published. Excel reports were integrated into PPS dashboards. Data connections, filters, and scheduling were established to provide automated daily generation and viewing of reports.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
1. Kusto (Azure Data Explorer) is a fast and flexible data exploration service for analyzing security and application logs, performance counters, and other streaming data.
2. A Data Engineer's role is evolving to focus more on real-time analysis using Kusto as opposed to traditional SQL. Understanding how to use Kusto's query engine and data ingestion capabilities is key.
3. Techniques like using materialized views, partitioning data, and leader-follower databases can help distribute workloads and improve query performance at scale in Kusto. However, Kusto has limitations around concurrency, memory usage, and result set sizes that need to be considered.
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
This document discusses how GraphLab Create can be used to build reusable data pipelines for predictive analytics. It provides examples of how tasks like model training, recommendation generation, and result persistence can be modularized and executed together as workflows. Key benefits highlighted include portability of code across environments like Hadoop and EC2, ability to incrementally develop and monitor pipelines, and managing dependencies and configurations automatically.
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering
This document summarizes a presentation about building a SQL interface for Apache Pig using Apache Calcite. It discusses using Calcite's query planning framework to translate SQL queries into Pig Latin scripts for execution on HDFS. The presenters describe their work at Salesforce using Calcite for batch querying across data sources, and outline their process for creating a Pig adapter for Calcite, including implementing Pig-specific operators and rules for translation. Lessons learned include that Calcite provides flexibility but documentation could be improved, and examples from other adapters were helpful for their implementation.
Whether you're a MongoDB professional or totally new to document databases, our MongoDB performance success factors & evaluation framework has something for you,
Curious about MongoDB performance?
Mydbops CTO, Manosh Malai illustrates the secret sauce for MongoDB performance best practices & analysis tool.
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
1. The document discusses building a minimal viable prediction service (MVP) to predict air quality using only Python and free serverless services in 90 minutes.
2. It describes creating feature, training, and inference pipelines to build an air quality prediction service using Hopsworks, Modal, and Streamlit/Gradio.
3. The pipelines would extract features from weather and air quality data, train a model, and deploy an inference pipeline to make predictions on new data.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
Similar to phoenix-on-calcite-hadoop-summit-2016 (20)
3. What is Apache Phoenix?
• A relational database layer for Apache HBase
– Query engine
• Transforms SQL queries into native HBase API calls
• Pushes as much work as possible onto the cluster for parallel
execution
– Metadata repository
• Typed access to data stored in HBase tables
– Transaction support
– Table Statistics
– A JDBC driver
4. Advanced Features
• Secondary indexes
• Strong SQL standard compliance
• Windowed aggregates
• Connectivity (e.g. remote JDBC driver, ODBC driver)
Created architectural pain… We decided to do it right!
5. Example 1: Optimizing Secondary Indexes
How we match secondary
indexes in Phoenix 4.8:
What about both?
SELECT * FROM Emp ORDER BY name
SELECT * FROM Emp WHERE empId > 100
CREATE TABLE Emps(empId INT PRIMARY KEY, name VARCHAR(100));
CREATE INDEX I_Emps_Name ON Emps(name);
SELECT * FROM Emp
WHERE empId > 100 ORDER BY name
Q1
Q2
Q3
I_Emps_Name
Emps
We need to make a cost-based decision! Statistics can help.
?
6. Phoenix + Calcite
• Both are Apache projects
• Involves changes to both projects
• Work is being done on a branch of Phoenix, with changes to Calcite
as needed
• Goals:
– Remove code! (Use Calcite’s SQL parser, validator)
– Improve planning (Faster planning, faster queries)
– Improve SQL compliance
– Some “free” SQL features (e.g. WITH, scalar subquery, FILTER)
– Close to full compatibility with current Phoenix SQL and APIs
• Status: beta, expected GA: late 2016
9. Phoenix + Calcite Architecture
Parser
Algebra
Phoenix Schema Logical + Phoenix Operators,
Builtin + Phoenix Rules,
Phoenix Statistics,
Phoenix Cost model
Data
JDBC (optional)
HBase Data
Phoenix Runtime
Data
Other (optional)
Query Plan
10. Cost-based Query Optimizer
with Apache Calcite
• Base all query optimization decisions on cost
– Filter push down; range scan vs. skip scan
– Hash aggregate vs. stream aggregate vs. partial stream aggregate
– Sort optimized out; sort/limit push through; fwd/rev/unordered
scan
– Hash join vs. merge join; join ordering
– Use of data table vs. index table
– All above (any many others) COMBINED
• Query optimizations are modeled as pluggable rules
11. Calcite Algebra
SELECT products.name, COUNT(*)
FROM sales
JOIN products USING (productId)
WHERE sales.discount IS NOT NULL
GROUP BY products.name
ORDER BY COUNT(*) DESC
scan
[products]
scan
[sales]
join
filter
aggregate
sort
translate SQL to
relational
algebra
12. Example 2: FilterIntoJoinRule
SELECT products.name, COUNT(*)
FROM sales
JOIN products USING (productId)
WHERE sales.discount IS NOT NULL
GROUP BY products.name
ORDER BY COUNT(*) DESC
scan
[products]
scan
[sales]
join
filter
aggregate
sort
scan
[products]
scan
[sales]
filter’
join’
aggregate
sort
FilterIntoJoinRule
translate SQL to
relational
algebra
13. Example 3: Phoenix Joins
• Hash join vs. Sort merge join
– Hash join good for: either input is small
– Sort merge join good for: both inputs are big
– Hash join downside: potential OOM
– Sort merge join downside: extra sorting required sometimes
• Better to exploit the sortedness of join input
• Better to exploit the sortedness of join output
14. Example 3: Calcite Algebra
SELECT empid, e.name, d.name, location
FROM emps AS e
JOIN depts AS d USING (deptno)
ORDER BY d.deptno
scan
[emps]
scan
[depts]
join
sort
project
translate SQL to
relational
algebra
15. Example 3: Plan Candidates
scan
[emps]
scan
[depts]
hash-join
sort
project
scan
[emps]
scan
[depts]
sort
merge-join
projectCandidate 1:
hash-join
*also what standalone
Phoenix compiler
would generate.
Candidate 2:
merge-join
1. Very little difference in all other operators: project, scan, hash-join or merge-join
2. Candidate 1 would sort “emps join depts”, while candidate 2 would only sort “emps”
Win
SortRemoveRule
sorted on [deptno]
SortRemoveRule
sorted on [e.deptno]
16. Example 3: Improved Plan
scan ‘depts’
send ‘depts’ over to RS
& build hash-cache
scan ‘emps’ hash-join ‘depts’
sort joined table on ‘e.deptno’
scan ‘emps’
merge-join ‘emps’ and ‘depts’
sort by ‘deptno’
scan ‘depts’
Old vs. New
1. Exploited the sortedness of join input
2. Exploited the sortedness of join output
21. Calcite Planning Process
SQL
parse
tree
Planner
RelNode
Graph
Sql-to-Rel Converter
SqlNode
! RelNode
+ RexNode
Node for each node in Input
Plan
Each node is a Set of
alternate Sub Plans
Set further divided into
Subsets: based on traits like
sortedness
1. Plan Graph
Rule: specifies an Operator
sub-graph to match and logic
to generate equivalent ‘better’
sub-graph
New and original sub-graph
both remain in contention
2. Rules
RelNodes have Cost &
Cumulative Cost
3. Cost Model
Used to plug in Schema,
cost formulas
Filter selectivity
Join selectivity
NDV calculations
4. Metadata Providers
Rule Match Queue
Best RelNode Graph
Translate to
runtime
Logical Plan
Based on “Volcano” & “Cascades” papers [G. Graefe]
Add Rule matches to Queue
Apply Rule match transformations
to plan graph
Iterate for fixed iterations or until
cost doesn’t change
Match importance based on cost of
RelNode and height
22. Views and materialized views
• A view is a named
relational expression,
stored in the catalog,
that is expanded
while planning a
query.
• A materialized view is an equivalence,
stored in the catalog, between a table
and a relational expression.
The planner substitutes the table into
queries where it will help, even if the
queries do not reference the
materialized view.
23. Query using a view
Scan [Emps]
Join [$0, $5]
Project [$0, $1, $2, $3]
Filter [age >= 50]
Aggregate [deptno, min(salary)]
Scan [Managers]
Aggregate [manager]
Scan [Emps]
SELECT deptno, min(salary)
FROM Managers
WHERE age >= 50
GROUP BY deptno
CREATE VIEW Managers AS
SELECT *
FROM Emps
WHERE EXISTS (
SELECT *
FROM Emps AS underling
WHERE underling.manager = emp.id)
view scan to
be expanded
24. After view expansion
Scan [Emps] Aggregate [manager]
Join [$0, $5]
Project [$0, $1, $2, $3]
Filter [age >= 50]
Aggregate [deptno, min(salary)]
Scan [Emps]
SELECT deptno, min(salary)
FROM Managers
WHERE age >= 50
GROUP BY deptno
CREATE VIEW Managers AS
SELECT *
FROM Emps
WHERE EXISTS (
SELECT *
FROM Emps AS underling
WHERE underling.manager = emp.id)
can be pushed
down
25. Materialized view
Scan [Emps]
Aggregate [deptno, gender,
COUNT(*), SUM(sal)]
Scan [EmpSummary]
=
Scan [Emps]
Filter [deptno = 10 AND gender = ‘M’]
Aggregate [COUNT(*)]
CREATE MATERIALIZED VIEW EmpSummary AS
SELECT deptno,
gender,
COUNT(*) AS c,
SUM(sal) AS s
FROM Emps
GROUP BY deptno, gender
SELECT COUNT(*)
FROM Emps
WHERE deptno = 10
AND gender = ‘M’
26. Materialized view, step 2: Rewrite query to
match
Scan [Emps]
Aggregate [deptno, gender,
COUNT(*), SUM(sal)]
Scan [EmpSummary]
=
Scan [Emps]
Filter [deptno = 10 AND gender = ‘M’]
Aggregate [deptno, gender,
COUNT(*) AS c, SUM(sal) AS s]
Project [c]
CREATE MATERIALIZED VIEW EmpSummary AS
SELECT deptno,
gender,
COUNT(*) AS c,
SUM(sal) AS s
FROM Emps
GROUP BY deptno, gender
SELECT COUNT(*)
FROM Emps
WHERE deptno = 10
AND gender = ‘M’
27. Materialized view, step 3: Substitute table
Scan [Emps]
Aggregate [deptno, gender,
COUNT(*), SUM(sal)]
Scan [EmpSummary]
=
Filter [deptno = 10 AND gender = ‘M’]
Project [c]
Scan [EmpSummary]
CREATE MATERIALIZED VIEW EmpSummary AS
SELECT deptno,
gender,
COUNT(*) AS c,
SUM(sal) AS s
FROM Emps
GROUP BY deptno, gender
SELECT COUNT(*)
FROM Emps
WHERE deptno = 10
AND gender = ‘M’
29. Example 1, Revisited: Secondary Index
Optimizer internally creates a mapping (query, table) equivalent to:
Scan [Emps]
Filter [deptno BETWEEN 100 and 150]
Project [deptno, name]
Sort [deptno]
CREATE MATERIALIZED VIEW I_Emp_Deptno AS
SELECT deptno, empno, name
FROM Emps
ORDER BY deptno
Scan [Emps]
Project [deptno, empno, name]
Sort [deptno, empno, name]
Filter [deptno BETWEEN 100 and 150]
Project [deptno, name]
Scan
[I_Emp_Deptno]
1,000
1,000
200
1600 1,000
1,000
200
very simple
cost based
on row-count
30. Beyond Phoenix 4.8
with Apache Calcite
• Get the missing SQL support
– WITH, UNNEST, Scalar subquery, etc.
• Materialized views
– To allow other forms of indices (maybe defined as external), e.g., a
filter view, a join view, or an aggregate view.
• Interop with other Calcite adapters
– Already used by Drill, Hive, Kylin, Samza, etc.
– Supports any JDBC source
– Initial version of Drill-Phoenix integration already working
31. Drillix: Interoperability with Apache Drill
SELECT deptno, sum(salary) FROM emps GROUP BY deptno
Stage 1:
Local Partial aggregation
Stage 3:
Final aggregation
Stage 2:
Shuffle partial results
Drill Aggregate [deptno, sum(salary)]
Drill Shuffle [deptno]
Phoenix Aggregate [deptno, sum(salary)]
Phoenix TableScan [emps]
Phoenix Tables on HBase