SlideShare a Scribd company logo
Measures in SQL
Julian Hyde (Google)
2024-05-22
SF Distributed Systems Meetup
San Francisco, California
Measures in SQL
ABSTRACT
SQL has attained widespread adoption, but Business Intelligence tools still use their
own higher level languages based upon a multidimensional paradigm. Composable
calculations are what is missing from SQL, and we propose a new kind of column,
called a measure, that attaches a calculation to a table. Like regular tables, tables
with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional
languages but retains SQL semantics. Measure invocations can be expanded in place
to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive
expressions (a way to evaluate multidimensional expressions that is consistent with
existing SQL semantics), a concept called evaluation context, and several operations
for setting and modifying the evaluation context.
SIGMOD, June 9–15, 2024, Santiago, Chile
Julian Hyde
Google Inc.
San Francisco, CA, USA
julianhyde@google.com
John Fremlin
Google Inc.
New York, NY, USA
fremlin@google.com
1. Problem
Tables are broken!
Tables are unable to provide reusable calculations.
Problem: Calculate profit margin of orders
SELECT prodName,
(SUM(revenue) - SUM(cost))
/ SUM(revenue) AS profitMargin
FROM Orders
WHERE prodName = ‘Happy’;
profitMargin
============
0.47
prodName custName orderDate revenue cost
Happy Alice 2023/11/28 6 4
Acme Bob 2023/11/27 5 2
Happy Alice 2024/11/28 7 4
Whizz Celia 2023/11/25 3 1
Happy Bob 2022/11/27 4 1
SELECT prodName,
(SUM(revenue) - SUM(cost))
/ SUM(revenue) AS profitMargin
FROM Orders
WHERE prodName = ‘Happy’;
profitMargin
============
0.47
Attempted solution: Create a view
SELECT AVG(profitMargin) AS profitMargin
FROM SummarizedOrders
WHERE prodName = ‘Happy’;
profitMargin
============
0.50
CREATE VIEW SummarizedOrders AS
SELECT prodName, orderDate,
(SUM(revenue) - SUM(cost))
/ SUM(revenue) AS profitMargin
FROM Orders
GROUP BY prodName, orderDate;
prodName custName orderDate revenue cost
Happy Alice 2023/11/28 6 4
Acme Bob 2023/11/27 5 2
Happy Alice 2024/11/28 7 4
Whizz Celia 2023/11/25 3 1
Happy Bob 2022/11/27 4 1
SELECT prodName,
(SUM(revenue) - SUM(cost))
/ SUM(revenue) AS profitMargin
FROM Orders
WHERE prodName = ‘Happy’;
profitMargin
============
0.47
2. Theory
1. Allow tables to have measures DESCRIBE EnhancedOrders;
column type
============ =====
prodName STRING
custName STRING
orderDate DATE
revenue INTEGER
cost INTEGER
profitMargin DOUBLE MEASURE
2. Operators for evaluating measures SELECT prodName, profitMargin
FROM EnhancedOrders
GROUP BY prodName;
prodName profitMargin
======== ============
Acme 0.60
Happy 0.47
Whizz 0.67
3. Syntax to define measures in a query SELECT *,
(SUM(revenue) - SUM(cost)) / SUM(revenue)
AS MEASURE profitMargin
FROM Orders
GROUP BY prodName;
Extend the relational model with measures
SELECT prodName,
profitMargin
FROM EnhancedOrders
GROUP BY prodName;
Definitions
A context-sensitive expression (CSE) is an expression
whose value is determined by an evaluation context.
An evaluation context is a predicate whose terms are
one or more columns from the same table.
● This set of columns is the dimensionality of the
CSE.
A measure is a special kind of column that becomes a
CSE when used in a query.
● A measure’s dimensionality is the set of
non-measure columns in its table.
● The data type of a measure that returns a value of
type t is t MEASURE, e.g. INTEGER MEASURE.
prodName profitMargin
======== ============
Acme 0.60
Happy 0.50
Whizz 0.67
SELECT (SUM(revenue) - SUM(cost))
/ SUM(revenue) AS profitMargin
FROM Orders
WHERE prodName = ‘Acme’;
profitMargin
============
0.60
profitMargin is a
measure (and a
CSE)
Dimensionality is
{prodName, custName,
orderDate, revenue,
cost}
Evaluation context
for this cell is
prodName = ‘Acme’
SELECT (SUM(revenue) - SUM(cost))
/ SUM(revenue) AS m
FROM Orders
WHERE prodName = ‘Acme’;
m
====
0.60
SELECT (SUM(revenue) - SUM(cost))
/ SUM(revenue) AS m
FROM Orders
WHERE prodName = ‘Happy’;
m
====
0.50
SELECT (SUM(revenue) - SUM(cost))
/ SUM(revenue) AS m
FROM Orders
WHERE prodName = ‘Whizz’
AND custName = ‘Bob’;
m
====
NULL
SELECT prodName,
profitMargin,
profitMargin
AT (SET prodName = ‘Happy’)
AS happyMargin,
profitMargin
AT (SET custName = ‘Bob’)
AS bobMargin
FROM EnhancedOrders
GROUP BY prodName;
AT operator
The context transformation operator AT modifies the
evaluation context.
Syntax:
expression AT (contextModifier…)
contextModifier ::=
WHERE predicate
| ALL
| ALL dimension
| SET dimension = [CURRENT] expression
| VISIBLE
prodName profitMargin happyMargin bobMargin
======== ============ =========== =========
Acme 0.60 0.50 0.60
Happy 0.50 0.50 0.75
Whizz 0.67 0.50 NULL
Evaluation context for
this cell is
prodName = ‘Whizz’
AND custName = ‘Bob’
3. Consequences
Grain-locking
What is the average age of the customer
who would ordered each product?
When we use an aggregate function in a
join query, it will ‘double count’ if the
join duplicates rows.
This is generally not we want for
measures – except if we want a
weighted average – but is difficult to
avoid in SQL.
Measures are locked to the grain of the
table that defined them.
WITH EnhancedCustomers AS (
SELECT *,
AVG(custAge) AS MEASURE avgAge
FROM Customers)
SELECT o.prodName,
AVG(c.custAge) AS weightedAvgAge,
c.avgAge AS avgAge
FROM Orders AS o
JOIN EnhancedCustomers AS c USING (custName)
GROUP BY o.prodName;
prodName weightedAvgAge avgAge
======== ============== ======
Acme 41 41
Happy 29 32
Whizz 17 17
prodName custName orderDate revenue cost
Happy Alice 2023/11/28 6 4
Acme Bob 2023/11/27 5 2
Happy Alice 2024/11/28 7 4
Whizz Celia 2023/11/25 3 1
Happy Bob 2022/11/27 4 1
custName custAge
Alice 23
Bob 41
Celia 17
Alice (age 23)
has two orders;
Bob (age 41) has
one order.
SELECT *,
SUM(cost) AS MEASURE sumCost,
SUM(revenue) AS MEASURE sumRevenue
FROM Orders
Composition & closure
Just as tables are closed under queries, so
tables-with-measures are closed under
queries-with-measures
Measures can reference measures
Complex analytical calculations without
touching the FROM clause
Evaluation contexts can be nested
SELECT *,
SUM(cost) AS MEASURE sumCost,
SUM(revenue) AS MEASURE sumRevenue,
(sumRevenue - sumCost) / sumRevenue
AS MEASURE profitMargin
FROM Orders
SELECT *,
SUM(cost) AS MEASURE sumCost,
SUM(revenue) AS MEASURE sumRevenue,
(sumRevenue - sumCost) / sumRevenue
AS MEASURE profitMargin,
sumRevenue
- sumRevenue AT (SET YEAR(orderDate)
= CURRENT YEAR(orderDate) - 1)
AS MEASURE revenueGrowthYoY
FROM Orders
SELECT *,
SUM(cost) AS MEASURE sumCost,
SUM(revenue) AS MEASURE sumRevenue,
(sumRevenue - sumCost) / sumRevenue
AS MEASURE profitMargin,
sumRevenue
- sumRevenue AT (SET YEAR(orderDate)
= CURRENT YEAR(orderDate) - 1)
AS MEASURE revenueGrowthYoY,
ARRAY_AGG(productId
ORDER BY sumRevenue DESC LIMIT 5)
AT (ALL productId)
AS MEASURE top5Products
FROM Orders;
SELECT *,
SUM(cost) AS MEASURE sumCost,
SUM(revenue) AS MEASURE sumRevenue,
(sumRevenue - sumCost) / sumRevenue
AS MEASURE profitMargin,
sumRevenue
- sumRevenue AT (SET YEAR(orderDate)
= CURRENT YEAR(orderDate) - 1)
AS MEASURE revenueGrowthYoY,
ARRAY_AGG(productId
ORDER BY sumRevenue DESC LIMIT 5)
AT (ALL productId)
AS MEASURE top5Products,
ARRAY_AGG(customerId
ORDER BY sumRevenue DESC LIMIT 3)
AT (ALL customerId
SET productId MEMBER OF top5Products
AT (SET YEAR(orderDate)
= CURRENT YEAR(orderDate) - 1))
AS MEASURE top3CustomersOfTop5Products
FROM Orders;
Relational algebra (bottom-up) Multidimensional (top-down)
Products
Customers
⨝
⨝
Σ
⨝
σ
Orders
Products
Customers
⨝
⨝
Σ
σ
Orders
π
(CustState: ‘CA’,
Date: ‘1994’,
Product: all)
(CustState: ‘CA’,
Date: ‘1995’,
Product: all)
Customer
Product
Date
Bottom-up vs Top-down query
Represent a Business Intelligence model as a SQL view
Orders Products
Customers
CREATE VIEW OrdersCube AS
SELECT *
FROM (
SELECT o.orderDate AS `order.date`,
o.revenue AS `order.revenue`,
SUM(o.revenue) AS MEASURE `order.sum_revenue`
FROM Orders) AS o
LEFT JOIN (
SELECT c.custName AS `customer.name`,
c.state AS `customer.state`,
c.custAge AS `customer.age`,
AVG(c.custAge) AS MEASURE `customer.avg_age`
FROM Customers) AS c
ON o.custName = c.custName
LEFT JOIN (
SELECT p.prodName AS `product.name`,
p.color AS `product.color`,
AVG(p.weight) AS MEASURE `product.avg_weight`
FROM Products) AS p
ON o.prodName = p.prodName;
SELECT `customer.state`, `product.avg_weight`
FROM OrdersCube
GROUP BY `customer.state`;
● SQL planner handles view expansion
● Grain locking makes it safe to use a
star schema
● Users can define new models simply
by writing queries
Implementing measures & CSEs as SQL rewrites
simple
complex
Complexity Query Expanded query
Simple measure can
be inlined
SELECT prodName, avgRevenue
FROM OrdersCube
GROUP BY prodName
SELECT prodName, AVG(revenue)
FROM orders
GROUP BY prodName
Join requires
grain-locking
SELECT prodName, avgAge
FROM OrdersCube
GROUP BY prodName
SELECT o.prodName, AVG(c.custAge PER
c.custName) FROM orders JOIN customers
GROUP BY prodName
→ (something with GROUPING SETS)
Period-over- period SELECT prodName, avgAge -
avgAge AT (SET year =
CURRENT year - 1)
FROM OrdersCube
GROUP BY prodName
(something with window aggregates)
Scalar subquery can
accomplish
anything
SELECT prodName, prodColor
avgAge AT (ALL custState
SET year = CURRENT year - 1)
FROM OrdersCube
GROUP BY prodName, prodColor
SELECT prodName, prodColor,
(SELECT … FROM orders
WHERE <evaluation context>)
FROM orders
GROUP BY prodName, prodColor
Summary
Top-down evaluation makes queries concise
Measures make calculations reusable
Measures don’t break SQL
Thank you!
Any questions?
@julianhyde
@JohnFremlin
@ApacheCalcite
https://calcite.apache.org
https://issues.apache.org/jira/browse/CALCITE-4496
http://www.hydromatic.net/hyde2024measures.pdf

More Related Content

Similar to Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

Sql wksht-3
Sql wksht-3Sql wksht-3
Sql wksht-3
Mukesh Tekwani
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
AtScale
 
Sql basics 2
Sql basics 2Sql basics 2
Sql basics 2
rchakra
 
James Colby Maddox Business Intellignece and Computer Science Portfolio
James Colby Maddox Business Intellignece and Computer Science PortfolioJames Colby Maddox Business Intellignece and Computer Science Portfolio
James Colby Maddox Business Intellignece and Computer Science Portfolio
colbydaman
 
Part3 Explain the Explain Plan
Part3 Explain the Explain PlanPart3 Explain the Explain Plan
Part3 Explain the Explain Plan
Maria Colgan
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
Julian Hyde
 
Project report aditi paul1
Project report aditi paul1Project report aditi paul1
Project report aditi paul1
guest9529cb
 
Meet the CBO in Version 11g
Meet the CBO in Version 11gMeet the CBO in Version 11g
Meet the CBO in Version 11g
Sage Computing Services
 
Oracle: Functions
Oracle: FunctionsOracle: Functions
Oracle: Functions
DataminingTools Inc
 
Oracle: Functions
Oracle: FunctionsOracle: Functions
Oracle: Functions
oracle content
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
Bob Litsinger
 
Intro to SQL for Beginners
Intro to SQL for BeginnersIntro to SQL for Beginners
Intro to SQL for Beginners
Product School
 
SSAS and MDX
SSAS and MDXSSAS and MDX
SSAS and MDX
carmenfaber
 
Data Modeling in Looker
Data Modeling in LookerData Modeling in Looker
Data Modeling in Looker
Looker
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
Simplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functionsSimplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functions
Clayton Groom
 
Use Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
Use Oracle 9i Summary Advisor To Better Manage Your Data WarehouseUse Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
Use Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
info_sunrise24
 
Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure Presentation
Amin Uddin
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
KalyankumarVenkat1
 

Similar to Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22) (20)

Sql wksht-3
Sql wksht-3Sql wksht-3
Sql wksht-3
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Sql basics 2
Sql basics 2Sql basics 2
Sql basics 2
 
James Colby Maddox Business Intellignece and Computer Science Portfolio
James Colby Maddox Business Intellignece and Computer Science PortfolioJames Colby Maddox Business Intellignece and Computer Science Portfolio
James Colby Maddox Business Intellignece and Computer Science Portfolio
 
Part3 Explain the Explain Plan
Part3 Explain the Explain PlanPart3 Explain the Explain Plan
Part3 Explain the Explain Plan
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Project report aditi paul1
Project report aditi paul1Project report aditi paul1
Project report aditi paul1
 
Meet the CBO in Version 11g
Meet the CBO in Version 11gMeet the CBO in Version 11g
Meet the CBO in Version 11g
 
Oracle: Functions
Oracle: FunctionsOracle: Functions
Oracle: Functions
 
Oracle: Functions
Oracle: FunctionsOracle: Functions
Oracle: Functions
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Intro to SQL for Beginners
Intro to SQL for BeginnersIntro to SQL for Beginners
Intro to SQL for Beginners
 
SSAS and MDX
SSAS and MDXSSAS and MDX
SSAS and MDX
 
Data Modeling in Looker
Data Modeling in LookerData Modeling in Looker
Data Modeling in Looker
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Simplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functionsSimplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functions
 
Use Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
Use Oracle 9i Summary Advisor To Better Manage Your Data WarehouseUse Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
Use Oracle 9i Summary Advisor To Better Manage Your Data Warehouse
 
Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure Presentation
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
 

More from Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
Julian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
Julian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
Julian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 

More from Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

  • 1. Measures in SQL Julian Hyde (Google) 2024-05-22 SF Distributed Systems Meetup San Francisco, California Measures in SQL ABSTRACT SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries. SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL. To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context. SIGMOD, June 9–15, 2024, Santiago, Chile Julian Hyde Google Inc. San Francisco, CA, USA julianhyde@google.com John Fremlin Google Inc. New York, NY, USA fremlin@google.com
  • 3. Tables are broken! Tables are unable to provide reusable calculations.
  • 4. Problem: Calculate profit margin of orders SELECT prodName, (SUM(revenue) - SUM(cost)) / SUM(revenue) AS profitMargin FROM Orders WHERE prodName = ‘Happy’; profitMargin ============ 0.47 prodName custName orderDate revenue cost Happy Alice 2023/11/28 6 4 Acme Bob 2023/11/27 5 2 Happy Alice 2024/11/28 7 4 Whizz Celia 2023/11/25 3 1 Happy Bob 2022/11/27 4 1 SELECT prodName, (SUM(revenue) - SUM(cost)) / SUM(revenue) AS profitMargin FROM Orders WHERE prodName = ‘Happy’; profitMargin ============ 0.47
  • 5. Attempted solution: Create a view SELECT AVG(profitMargin) AS profitMargin FROM SummarizedOrders WHERE prodName = ‘Happy’; profitMargin ============ 0.50 CREATE VIEW SummarizedOrders AS SELECT prodName, orderDate, (SUM(revenue) - SUM(cost)) / SUM(revenue) AS profitMargin FROM Orders GROUP BY prodName, orderDate; prodName custName orderDate revenue cost Happy Alice 2023/11/28 6 4 Acme Bob 2023/11/27 5 2 Happy Alice 2024/11/28 7 4 Whizz Celia 2023/11/25 3 1 Happy Bob 2022/11/27 4 1 SELECT prodName, (SUM(revenue) - SUM(cost)) / SUM(revenue) AS profitMargin FROM Orders WHERE prodName = ‘Happy’; profitMargin ============ 0.47
  • 7. 1. Allow tables to have measures DESCRIBE EnhancedOrders; column type ============ ===== prodName STRING custName STRING orderDate DATE revenue INTEGER cost INTEGER profitMargin DOUBLE MEASURE 2. Operators for evaluating measures SELECT prodName, profitMargin FROM EnhancedOrders GROUP BY prodName; prodName profitMargin ======== ============ Acme 0.60 Happy 0.47 Whizz 0.67 3. Syntax to define measures in a query SELECT *, (SUM(revenue) - SUM(cost)) / SUM(revenue) AS MEASURE profitMargin FROM Orders GROUP BY prodName; Extend the relational model with measures
  • 8. SELECT prodName, profitMargin FROM EnhancedOrders GROUP BY prodName; Definitions A context-sensitive expression (CSE) is an expression whose value is determined by an evaluation context. An evaluation context is a predicate whose terms are one or more columns from the same table. ● This set of columns is the dimensionality of the CSE. A measure is a special kind of column that becomes a CSE when used in a query. ● A measure’s dimensionality is the set of non-measure columns in its table. ● The data type of a measure that returns a value of type t is t MEASURE, e.g. INTEGER MEASURE. prodName profitMargin ======== ============ Acme 0.60 Happy 0.50 Whizz 0.67 SELECT (SUM(revenue) - SUM(cost)) / SUM(revenue) AS profitMargin FROM Orders WHERE prodName = ‘Acme’; profitMargin ============ 0.60 profitMargin is a measure (and a CSE) Dimensionality is {prodName, custName, orderDate, revenue, cost} Evaluation context for this cell is prodName = ‘Acme’
  • 9. SELECT (SUM(revenue) - SUM(cost)) / SUM(revenue) AS m FROM Orders WHERE prodName = ‘Acme’; m ==== 0.60 SELECT (SUM(revenue) - SUM(cost)) / SUM(revenue) AS m FROM Orders WHERE prodName = ‘Happy’; m ==== 0.50 SELECT (SUM(revenue) - SUM(cost)) / SUM(revenue) AS m FROM Orders WHERE prodName = ‘Whizz’ AND custName = ‘Bob’; m ==== NULL SELECT prodName, profitMargin, profitMargin AT (SET prodName = ‘Happy’) AS happyMargin, profitMargin AT (SET custName = ‘Bob’) AS bobMargin FROM EnhancedOrders GROUP BY prodName; AT operator The context transformation operator AT modifies the evaluation context. Syntax: expression AT (contextModifier…) contextModifier ::= WHERE predicate | ALL | ALL dimension | SET dimension = [CURRENT] expression | VISIBLE prodName profitMargin happyMargin bobMargin ======== ============ =========== ========= Acme 0.60 0.50 0.60 Happy 0.50 0.50 0.75 Whizz 0.67 0.50 NULL Evaluation context for this cell is prodName = ‘Whizz’ AND custName = ‘Bob’
  • 11. Grain-locking What is the average age of the customer who would ordered each product? When we use an aggregate function in a join query, it will ‘double count’ if the join duplicates rows. This is generally not we want for measures – except if we want a weighted average – but is difficult to avoid in SQL. Measures are locked to the grain of the table that defined them. WITH EnhancedCustomers AS ( SELECT *, AVG(custAge) AS MEASURE avgAge FROM Customers) SELECT o.prodName, AVG(c.custAge) AS weightedAvgAge, c.avgAge AS avgAge FROM Orders AS o JOIN EnhancedCustomers AS c USING (custName) GROUP BY o.prodName; prodName weightedAvgAge avgAge ======== ============== ====== Acme 41 41 Happy 29 32 Whizz 17 17 prodName custName orderDate revenue cost Happy Alice 2023/11/28 6 4 Acme Bob 2023/11/27 5 2 Happy Alice 2024/11/28 7 4 Whizz Celia 2023/11/25 3 1 Happy Bob 2022/11/27 4 1 custName custAge Alice 23 Bob 41 Celia 17 Alice (age 23) has two orders; Bob (age 41) has one order.
  • 12. SELECT *, SUM(cost) AS MEASURE sumCost, SUM(revenue) AS MEASURE sumRevenue FROM Orders Composition & closure Just as tables are closed under queries, so tables-with-measures are closed under queries-with-measures Measures can reference measures Complex analytical calculations without touching the FROM clause Evaluation contexts can be nested SELECT *, SUM(cost) AS MEASURE sumCost, SUM(revenue) AS MEASURE sumRevenue, (sumRevenue - sumCost) / sumRevenue AS MEASURE profitMargin FROM Orders SELECT *, SUM(cost) AS MEASURE sumCost, SUM(revenue) AS MEASURE sumRevenue, (sumRevenue - sumCost) / sumRevenue AS MEASURE profitMargin, sumRevenue - sumRevenue AT (SET YEAR(orderDate) = CURRENT YEAR(orderDate) - 1) AS MEASURE revenueGrowthYoY FROM Orders SELECT *, SUM(cost) AS MEASURE sumCost, SUM(revenue) AS MEASURE sumRevenue, (sumRevenue - sumCost) / sumRevenue AS MEASURE profitMargin, sumRevenue - sumRevenue AT (SET YEAR(orderDate) = CURRENT YEAR(orderDate) - 1) AS MEASURE revenueGrowthYoY, ARRAY_AGG(productId ORDER BY sumRevenue DESC LIMIT 5) AT (ALL productId) AS MEASURE top5Products FROM Orders; SELECT *, SUM(cost) AS MEASURE sumCost, SUM(revenue) AS MEASURE sumRevenue, (sumRevenue - sumCost) / sumRevenue AS MEASURE profitMargin, sumRevenue - sumRevenue AT (SET YEAR(orderDate) = CURRENT YEAR(orderDate) - 1) AS MEASURE revenueGrowthYoY, ARRAY_AGG(productId ORDER BY sumRevenue DESC LIMIT 5) AT (ALL productId) AS MEASURE top5Products, ARRAY_AGG(customerId ORDER BY sumRevenue DESC LIMIT 3) AT (ALL customerId SET productId MEMBER OF top5Products AT (SET YEAR(orderDate) = CURRENT YEAR(orderDate) - 1)) AS MEASURE top3CustomersOfTop5Products FROM Orders;
  • 13. Relational algebra (bottom-up) Multidimensional (top-down) Products Customers ⨝ ⨝ Σ ⨝ σ Orders Products Customers ⨝ ⨝ Σ σ Orders π (CustState: ‘CA’, Date: ‘1994’, Product: all) (CustState: ‘CA’, Date: ‘1995’, Product: all) Customer Product Date Bottom-up vs Top-down query
  • 14. Represent a Business Intelligence model as a SQL view Orders Products Customers CREATE VIEW OrdersCube AS SELECT * FROM ( SELECT o.orderDate AS `order.date`, o.revenue AS `order.revenue`, SUM(o.revenue) AS MEASURE `order.sum_revenue` FROM Orders) AS o LEFT JOIN ( SELECT c.custName AS `customer.name`, c.state AS `customer.state`, c.custAge AS `customer.age`, AVG(c.custAge) AS MEASURE `customer.avg_age` FROM Customers) AS c ON o.custName = c.custName LEFT JOIN ( SELECT p.prodName AS `product.name`, p.color AS `product.color`, AVG(p.weight) AS MEASURE `product.avg_weight` FROM Products) AS p ON o.prodName = p.prodName; SELECT `customer.state`, `product.avg_weight` FROM OrdersCube GROUP BY `customer.state`; ● SQL planner handles view expansion ● Grain locking makes it safe to use a star schema ● Users can define new models simply by writing queries
  • 15. Implementing measures & CSEs as SQL rewrites simple complex Complexity Query Expanded query Simple measure can be inlined SELECT prodName, avgRevenue FROM OrdersCube GROUP BY prodName SELECT prodName, AVG(revenue) FROM orders GROUP BY prodName Join requires grain-locking SELECT prodName, avgAge FROM OrdersCube GROUP BY prodName SELECT o.prodName, AVG(c.custAge PER c.custName) FROM orders JOIN customers GROUP BY prodName → (something with GROUPING SETS) Period-over- period SELECT prodName, avgAge - avgAge AT (SET year = CURRENT year - 1) FROM OrdersCube GROUP BY prodName (something with window aggregates) Scalar subquery can accomplish anything SELECT prodName, prodColor avgAge AT (ALL custState SET year = CURRENT year - 1) FROM OrdersCube GROUP BY prodName, prodColor SELECT prodName, prodColor, (SELECT … FROM orders WHERE <evaluation context>) FROM orders GROUP BY prodName, prodColor
  • 16. Summary Top-down evaluation makes queries concise Measures make calculations reusable Measures don’t break SQL