A semantic layer, also known as a metrics layer, lies between business users and the database, and lets those users compose queries in the concepts that they understand. It also governs access to the data, manages data transformations, and can tune the database by defining materializations.
Like many new ideas, the semantic layer is a distillation and evolution of many old ideas, such as query languages, multidimensional OLAP, and query federation.
In this talk, we describe the features we are adding to Calcite to define business views, query measures, and optimize performance.
A talk given at Community over Code, the annual conference of the Apache Software Foundation, in Halifax, NS, on 9th October, 2023.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support measures, evaluation context, and multidimensional expressions. With these concepts you can define data models that contain measures, use them in queries, and define new measures in queries.
A talk given by Julian Hyde to Apache Calcite's virtual meetup, 2023-03-15.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems.
Batch Determination Based Delivery ATP and Auto Delivery Quantity AdjustmentVijay Pisipaty
This solution provides an ability, in SAP Delivery Creation, for Batch Determination Based Delivery ATP and Delivery Quantity adjustment to sum of batch split quantities.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support measures, evaluation context, and multidimensional expressions. With these concepts you can define data models that contain measures, use them in queries, and define new measures in queries.
A talk given by Julian Hyde to Apache Calcite's virtual meetup, 2023-03-15.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems.
Batch Determination Based Delivery ATP and Auto Delivery Quantity AdjustmentVijay Pisipaty
This solution provides an ability, in SAP Delivery Creation, for Batch Determination Based Delivery ATP and Delivery Quantity adjustment to sum of batch split quantities.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
OData presentation organized in ITC Hub Pancevo, Serbia, 10. Feb. 2018. https://www.meetup.com/Web-Development-Pancevo/events/247493392/. OData is enhancement of classic REST API concept that adds querying capabilities.
Frances Jones: "Tableau vs PowerBI"
THE SHOWDOWN! Two of the leading tools for data visualization come head-to-head in this presentation from Frances Jones. Frances is a leading expert and master in displaying quantitative information and has had many years of expertise using both of these tools to wow her clients. Come to hear her perspective on the strengths and weaknesses of both tools, as well as plenty of handy tips that will help you improve your viz skills.
About the Speaker:
Frances is a vivacious lover of data.
Big data consultant by day, data communicator & human behavioral enthusiast by night! She has had many years of experience across many visualization tools and platforms and is constantly pushing the boundaries on what is possible in this space.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
OData presentation organized in ITC Hub Pancevo, Serbia, 10. Feb. 2018. https://www.meetup.com/Web-Development-Pancevo/events/247493392/. OData is enhancement of classic REST API concept that adds querying capabilities.
Frances Jones: "Tableau vs PowerBI"
THE SHOWDOWN! Two of the leading tools for data visualization come head-to-head in this presentation from Frances Jones. Frances is a leading expert and master in displaying quantitative information and has had many years of expertise using both of these tools to wow her clients. Come to hear her perspective on the strengths and weaknesses of both tools, as well as plenty of handy tips that will help you improve your viz skills.
About the Speaker:
Frances is a vivacious lover of data.
Big data consultant by day, data communicator & human behavioral enthusiast by night! She has had many years of experience across many visualization tools and platforms and is constantly pushing the boundaries on what is possible in this space.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support metrics (which we call ‘measures’), filter context, and analytic expressions. With these concepts you can define data models (which we call Analytic Views) that contain metrics, use them in queries, and define new metrics in queries.
In this talk by the original developer of Apache Calcite, we describe the SQL syntax extensions for metrics,
and how to use them for cross-dimensional calculations such as period-over-period, percent-of-total,
non-additive and semi-additive measures. We describe how we got around fundamental limitations in SQL
semantics, and approaches for optimizing queries that use metrics.
A talk given by Julian Hyde at Data Council, Austin, TX, on March 29, 2023.
Company segmentation - an approach with RCasper Crause
We classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
Morel, a data-parallel programming languageJulian Hyde
What would the perfect data-parallel programming language look like? It would be as expressive as a general-purpose functional programming language, as powerful and concise as SQL, and run programs just as efficiently on a laptop or a thousand-node cluster.
We present Morel, a functional programming language with relational extensions, working towards that goal. Morel is implemented in the Apache Calcite community on top of Calcite’s relational algebra framework. In this talk, we describe Morel’s evolution, including how we are pushing Calcite’s capabilities with graph and recursive queries.
A talk given by Julian Hyde at ApacheCon, New Orleans, October 4th 2022.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
The evolution of Apache Calcite and its CommunityJulian Hyde
Apache Calcite is an open source framework for building databases, and includes a SQL parser, relational algebra, and a highly extensible query optimizer.
It has achieved wide adoption, used in many commercial products, open source projects, and as a test bed for computer science research.
But there is a bootstrap problem: If software is written by a community of contributors, and each contributor acts in their own self-interest, how do you get the first working version of the product? The answer is in the story of how the technology evolved, and how the community evolved with it, and in this talk we tell that story.
The Hop project entered Apache Software Foundation as an Incubator project in 2020, and Julian Hyde, one of their mentors, gave this presentation to educate the initial committers on the Apache Way and what to expect during the Incubation process.
The talk was given by Julian Hyde on October 1st, 2020, with the original title "Apache Incubation - What's it all about?"
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.
A talk given by Eli Levine and Julian Hyde at Apache: Big Data, Miami, on May 17th, 2017.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Building a semantic/metrics layer using Calcite
1. Building a semantic/metrics
layer using Apache Calcite
Julian Hyde (Calcite, Google)
Community over Code • Halifax, Nova Scotia • 2023-10-09
2. Abstract
A semantic layer, also known as a metrics layer, lies between business users and the
database, and lets those users compose queries in the concepts that they understand.
It also governs access to the data, manages data transformations, and can tune the
database by defining materializations.
Like many new ideas, the semantic layer is a distillation and evolution of many old ideas, such as query
languages, multidimensional OLAP, and query federation.
In this talk, we describe the features we are adding to Calcite to define business views, query measures, and
optimize performance.
Julian Hyde is the original developer of Apache Calcite, an open source framework for building data
management systems, and Morel, a functional query language. Previously he created Mondrian, an analytics
engine, and SQLstream, an engine for continuous queries. He is a staff engineer at Google, where he works on
Looker and BigQuery.
5. Data system = Model + Query + Engine
Query
language
Data
model
Engine
6. Agenda
1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
7. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
8. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
9. SQL vs BI
BI tools implement their own languages on top of SQL. Why not SQL?
Possible reasons:
● Semantic Model
● Control presentation / visualization
● Governance
● Pre-join tables
● Define reusable calculations
● Ask complex questions in a concise way
10. Processing BI in SQL
Why we should do it
● Move processing, not data
● Cloud SQL scale
● Remove data lag
● SQL is open
Why it’s hard
● Different paradigm
● More complex data model
● Can’t break SQL
15. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
16. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
17. Some multidimensional queries
● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date).
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
● For each product give its market share in its category today minus its market share in its category in
October 1994.
● Select top 5 suppliers for each product category for last year, based on total sales.
● For each product category, select total sales this month of the product that had highest sales in that
category last month.
● Select suppliers that currently sell the highest selling product of last month.
● Select suppliers for which the total sale of every product increased in each of last 5 years.
● Select suppliers for which the total sale of every product category increased in each of last 5 years.
From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
18. Some multidimensional queries
● Give the total sales for each product in each quarter of 1995. (Note that quarter is a function of date).
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
● For each product give its market share in its category today minus its market share in its category in
October 1994.
● Select top 5 suppliers for each product category for last year, based on total sales.
● For each product category, select total sales this month of the product that had highest sales in that
category last month.
● Select suppliers that currently sell the highest selling product of last month.
● Select suppliers for which the total sale of every product increased in each of last 5 years
● Select suppliers for which the total sale of every product category increased in each of last 5 years.
From [Agrawal1997]. Assumes a database with dimensions {supplier, date, product} and measure {sales}.)
19. Query:
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
SQL MDX
SELECT p.prodId,
s95.sales,
(s95.sales - s94.sales) / s95.sales
FROM (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId) AS s95
LEFT JOIN (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1994-01-01’
GROUP BY p.prodId) AS s94
USING (prodId)
WITH MEMBER [Measures].[Sales Last Year] =
([Measures].[Sales],
ParallelPeriod([Date], 1, [Date].[Year]))
MEMBER [Measures].[Sales Growth] =
([Measures].[Sales]
- [Measures].[Sales Last Year])
/ [Measures].[Sales Last Year]
SELECT [Measures].[Sales Growth] ON COLUMNS,
[Product].Members ON ROWS
FROM [Sales]
WHERE [Supplier].[ACE]
20. Query:
● For supplier “Ace” and for each product, give the fractional increase in the sales in January 1995 relative to
the sales in January 1994.
SQL SQL with measures
SELECT p.prodId,
s95.sales,
(s95.sales - s94.sales) / s95.sales
FROM (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId) AS s95
LEFT JOIN (
SELECT p.prodId, SUM(s.sales) AS sales
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId)
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1994-01-01’
GROUP BY p.prodId) AS s94
USING (prodId)
SELECT p.prodId,
SUM(s.sales) AS MEASURE sumSales,
sumSales AT (SET FLOOR(s.date TO MONTH)
= ‘1994-01-01’)
AS MEASURE sumSalesLastYear
FROM Sales AS s
JOIN Suppliers AS u USING (suppId)
JOIN Products AS p USING (prodId))
WHERE u.name = ‘ACE’
AND FLOOR(s.date TO MONTH) = ‘1995-01-01’
GROUP BY p.prodId
21. Self-joins, correlated subqueries, window aggregates, measures
Window aggregate functions were introduced to save on
self-joins.
Some DBs rewrite scalar subqueries and self-joins to
window aggregates [Zuzarte2003].
Window aggregates are more concise, easier to optimize,
and often more efficient.
However, window aggregates can only see data that is from
the same table, and is allowed by the WHERE clause.
Measures overcome that limitation.
SELECT *
FROM Employees AS e
WHERE sal > (
SELECT AVG(sal)
FROM Employees
WHERE deptno = e.deptno)
SELECT *
FROM Employees AS e
WHERE sal > AVG(sal)
OVER (PARTITION BY deptno)
22. A measure is… ?
… a column with an aggregate function. SUM(sales)
23. A measure is… ?
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
24. A measure is… ?
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
… a column that, when used as expression,
can evaluate itself in any context.
(SELECT SUM(forecastSales)
FROM SalesForecast AS s
WHERE predicate(s))
ExchService$ClosingRate(
‘USD’, ‘EUR’, sales.date)
25. A measure is…
… a column with an aggregate function. SUM(sales)
… a column that, when used as an
expression, knows how to aggregate itself.
(SUM(sales) - SUM(cost))
/ SUM(sales)
… a column that, when used as expression,
can evaluate itself in any context.
Its value depends on, and only on, the
predicate placed on its dimensions.
(SELECT SUM(forecastSales)
FROM SalesForecast AS s
WHERE predicate(s))
ExchService$ClosingRate(
‘USD’, ‘EUR’, sales.date)
26. SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2
FROM
WHERE deptno < 30
SELECT deptno, AVG(avgSal) AS avgSal2
FROM
GROUP BY deptno
Table model
Tables are SQL’s fundamental
model.
The model is closed – queries
consume and produce tables.
Tables are opaque – you can’t
deduce the type, structure or
private data of a table.
SELECT deptno, job,
AVG(sal) AS avgSal
FROM Employees
GROUP BY deptno, job
Employees2
Employees3
27. SELECT MOD(deptno, 2) = 0 AS evenDeptno, avgSal2
FROM
WHERE deptno < 30
SELECT deptno, AVG(avgSal) AS avgSal2
FROM
GROUP BY deptno
Table model
Tables are SQL’s fundamental
model.
The model is closed – queries
consume and produce tables.
Tables are opaque – you can’t
deduce the type, structure or
private data of a table.
SELECT deptno, job,
AVG(sal) AS avgSal
FROM Employees
GROUP BY deptno, job
28. SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal
FROM
AS e
JOIN Departments AS d USING (deptno)
WHERE d.dname <> ‘MARKETING’
GROUP BY deptno, job
We propose to allow any table and
query to have measure columns.
The model is closed – queries
consume and produce
tables-with-measures.
Tables-with-measures are
semi-opaque – you can’t deduce the
type, structure or private data, but
you can evaluate the measure in any
context that can be expressed as a
predicate on the measure’s
dimensions.
SELECT *,
avgSal AS MEASURE avgSal,
avgSal AT (CLEAR deptno) AS MEASURE deptAvgSal
FROM
Table model with measures
SELECT *,
AVG(sal) AS MEASURE avgSal
FROM Employees
AnalyticEmployees
AnalyticEmployees2
29. SELECT e.deptno, e.job, d.dname, e.avgSal / e.deptAvgSal
FROM
AS e
JOIN Departments AS d USING (deptno)
WHERE d.dname <> ‘MARKETING’
GROUP BY deptno, job
We propose to allow any table and
query to have measure columns.
The model is closed – queries
consume and produce
tables-with-measures.
Tables-with-measures are
semi-opaque – you can’t deduce the
type, structure or private data, but
you can evaluate the measure in any
context that can be expressed as a
predicate on the measure’s
dimensions.
SELECT *,
avgSal AS MEASURE avgSal,
avgSal AT (ALL deptno) AS MEASURE deptAvgSal
FROM
Table model with measures
SELECT *,
AVG(sal) AS MEASURE avgSal
FROM Employees
30. Syntax
expression AS MEASURE – defines a measure in the SELECT clause
AGGREGATE(measure) – evaluates a measure in a GROUP BY query
expression AT (contextModifier…) – evaluates expression in a modified context
contextModifier ::=
ALL
| ALL dimension [, dimension…]
| ALL EXCEPT dimension [, dimension…]
| SET dimension = [CURRENT] expression
| VISIBLE
aggFunction(aggFunction(expression) PER dimension) – multi-level aggregation
31. Plan of attack
1. Add measures to the table model, and allow queries to use them
◆ Measures are defined only via the Table API
2. Define measures using SQL expressions (AS MEASURE)
◆ You can still define them using the Table API
3. Context-sensitive expressions (AT)
32. Semantics
0. We have a measure M, value type V,
in a table T.
CREATE VIEW AnalyticEmployees AS
SELECT *, AVG(sal) AS MEASURE avgSal
FROM Employees
1. System defines a row type R with the
non-measure columns.
CREATE TYPE R AS
ROW (deptno: INTEGER, job: VARCHAR)
2. System defines an auxiliary function
for M. (Function is typically a scalar
subquery that references the measure’s
underlying table.)
CREATE FUNCTION computeAvgSal(
rowPredicate: FUNCTION<R, BOOLEAN>) =
(SELECT AVG(e.sal)
FROM Employees AS e
WHERE APPLY(rowPredicate, e))
33. Semantics (continued)
3. We have a query that uses M. SELECT deptno,
avgSal
/ avgSal AT (ALL deptno)
FROM AnalyticEmployees AS e
GROUP BY deptno
4. Substitute measure references with
calls to the auxiliary function with the
appropriate predicate
SELECT deptno,
computeAvgSal(r ->🠚(r.deptno = e.deptno))
/ computeAvgSal(r 🠚 TRUE))
FROM AnalyticEmployees AS e
GROUP BY deptno
5. Planner inlines computeAvgSal and
scalar subqueries
SELECT deptno, AVG(sal) / MIN(avgSal)
FROM (
SELECT deptno, sal,
AVG(sal) OVER () AS avgSal
FROM Employees)
GROUP BY deptno
34. Calculating at the right grain
Example Formula Grain
Computing the revenue from
units and unit price
units * pricePerUnit AS revenue Row
Sum of revenue (additive) SUM(revenue)
AS MEASURE sumRevenue
Top
Profit margin (non-additive) (SUM(revenue) - SUM(cost))
/ SUM(revenue)
AS MEASURE profitMargin
Top
Inventory (semi-additive) SUM(LAST_VALUE(unitsInStock)
PER inventoryDate)
AS MEASURE sumInventory
Intermediate
Daily average (weighted
average)
AVG(sumRevenue PER orderDate)
AS MEASURE dailyAvgRevenue
Intermediate
35. Subtotals & visible
SELECT deptno, job,
SUM(sal), sumSal
FROM (
SELECT *,
SUM(sal) AS MEASURE sumSal
FROM Employees)
WHERE job <> ‘ANALYST’
GROUP BY ROLLUP(deptno, job)
ORDER BY 1,2
deptno job SUM(sal) sumSal
10 CLERK 1,300 1,300
10 MANAGER 2,450 2,450
10 PRESIDENT 5,000 5,000
10 8,750 8,750
20 CLERK 1,900 1,900
20 MANAGER 2,975 2,975
20 4,875 10,875
30 CLERK 950 950
30 MANAGER 2,850 2,850
30 SALES 5,600 5,600
30 9,400 9,400
20,750 29,025
Measures by default sum ALL rows;
Aggregate functions sum only VISIBLE rows
36. Visible
Expression Example Which rows?
Aggregate function SUM(sal) Visible only
Measure sumSal All
AGGREGATE applied to measure AGGREGATE(sumSal) Visible only
Measure with VISIBLE sumSal AT (VISIBLE) Visible only
Measure with ALL sumSal AT (ALL) All
37. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
38. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
39. Forecasting
A forecast is simply a measure whose value at
some point in the future is determined, in some
manner, by a calculation on past data.
SELECT year(order_date), product, revenue,
forecast_revenue
FROM Orders
WHERE year(order_date) BETWEEN 2018 AND 2022
GROUP BY 1, 2
40. Forecasting: implementation
Problems
1. Predictive model under the forecast
(such as ARIMA or linear regression) is
probably too expensive to re-compute for
every query
2. We want to evaluate forecast for regions
for which there is not (yet) any data
Solutions
1. Amortize the cost of running the model
using (some kind of) materialized view
2. Add a SQL EXTEND operation to
implicitly generate data
SELECT year(order_date), product, revenue,
forecast_revenue
FROM Orders EXTEND (order_date)
WHERE year(order_date) BETWEEN 2021 AND 2025
GROUP BY 1, 2
41. Clustering
A clustering algorithm assigns data points to
regions of N-dimensional space called clusters such
that points that are in the same cluster are, by some
measure, close to each other and distant from points
in other clusters.
SELECT id, firstName, lastName, firstPurchaseDate,
latitude, longitude, revenue, region
FROM Customers;
CREATE VIEW Customers AS
SELECT *,
KMEANS(3, ROW(latitude, longitude)) AS MEASURE
region,
(SELECT SUM(revenue)
FROM Orders AS o
WHERE o.customerId = c.id) AS MEASURE revenue
FROM BaseCustomers AS c;
region is a measure (based on the
centroid of a cluster)
42. Clustering: fixing the baseline
The measure is a little too dynamic. Fix the baseline, so that cluster centroids don’t
change from one query to the next:
SELECT id, firstName, lastName, firstPurchaseDate,
latitude, longitude,
region AT (ALL
SET YEAR(firstPurchaseDate) = 2020)
FROM Customers;
43. Clustering: amortizing the cost
To amortize the cost of the algorithm, create a materialized view:
CREATE MATERIALIZED VIEW CustomersMV AS
SELECT *,
region AT (ALL
SET YEAR(firstPurchaseDate) = 2020) AS region2020
FROM Customers;
44. Classification
Classification predicts the value of a variable given the values of other variables and a
model trained on similar data.
For example, does a particular household own a dog?
Whether they have a dog may depend on household income, education level, location
of the household, purchasing history of the household.
SELECT last_name, zipcode, probability_that_household_has_dog,
expected_dog_count
FROM Customers
WHERE state = ‘AZ’
45. Classification: training & running
Pseudo-function CLASSIFY:
SELECT last_name, zipcode,
CLASSIFY(firstPurchaseDate = ‘2023-05-01’,
has_dog,
ROW (zipcode, state, income_level, education_level))
AS probability_that_household_has_dog,
expected_dog_count
FROM Customers
GROUP BY state
FUNCTION classify(isTraining, actualValue, features)
We assume that has_dog has
the correct value for customers
who purchased on 2023-05-01
A SQL view can both train the algorithm (given the correct
result) and execute it (generating the result from features):
46. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
47. 1. Relational model vs
dimensional model
2. Adding measures to SQL
3. Machine-learning patterns
4. Semantic layer
49. Natural language query
Example query:
“Show me the top 5 products in each state where revenue declined since last year”
“Revenue” is a measure.
“Declined since last year” asks whether
revenue - revenue AT (SET year = CURRENT year - 1)
is negative.
“Products in each state” establishes the filter context.
51. Extended semantic model
“Show me regions where customers ordered low-inventory products last year”
Data model is a graph that
connects business views:
● Business views – tables,
possibly based on joins, with
measures, and display hints
● Domains – shared attributes
● Entities – shared dimensions
● Metrics – shared measures
● Ontology/synonyms
Do we need a new query language?
orders
product
customer
warehouse
shipments
inventory
geography
52.
53. Summary
Measures in SQL allow…
● concise queries without self-joins
● top-down evaluation
● reusable calculations
● natural-language query
…and don’t break SQL
A semantic model is table with measures, accessed via analytic SQL..
A extended semantic model links such tables into a knowledge graph.
54. Resources
Papers
● “Modeling multidimensional databases”
(Agrawal, Gupta, and Sarawagi, 1997)
● “WinMagic: Subquery Elimination Using
Window Aggregation” (Zuzarte, Pirahash,
Ma, Cheng, Liu, and Wong, 2003)
● “Analyza: Exploring Data with
Conversation” (Dhamdhere, McCurley,
Nahmias, Sundararajan, Yan, 2017)
Issues
● [CALCITE-4488] WITHIN DISTINCT
clause for aggregate functions
(experimental)
● [CALCITE-4496] Measure columns
("SELECT ... AS MEASURE")
● [CALCITE-5105] Add MEASURE type and
AGGREGATE aggregate function
● [CALCITE-5155] Custom time frames
● [CALCITE-xxxx] PER operator
● [CALCITE-5692] Add AT operator, for
context-sensitive expressions
● [CALCITE-5951] PRECEDES function, for
period-to-date calculations