NoSQL no more: SQL on Druid with Apache Calcite

NoSQL no more
SQL on Druid with Apache Calcite
Gian Merlino
gian@imply.io

Who am I?
Gian Merlino
Committer & PMC member on
Committer on Apache Calcite
Cofounder at
2

Agenda
● What is Druid?
● What is NoSQL?
● What is Apache Calcite?
● From NoSQL to SQL
● Do try this at home!
3

4
open source, high-performance,
column-oriented, distributed data store

What is Druid?
● “high performance”: low query latency, high ingest rates
● “column-oriented”: best possible scan rates
● “distributed”: deployed in clusters, typically 10s–100s of nodes
● “data store”: the cluster stores a copy of your data
5

The Problem
● OLAP slice-and-dice for big data
● Interactive exploration
● Look under the hood of reports and dashboards
● And we want our data fresh, too
7

Challenges
● Scale: big data is tough to process quickly
● Complexity: too much fine grain to precompute
● High dimensionality: 10s or 100s of dimensions
● Concurrency: many users and tenants
● Freshness: load from streams
9

Motivation
● Sub-second responses allow dialogue with data
● Rapid iteration on questions
● Remove barriers to understanding
10

Powered by Druid
11
Source: http://druid.io/druid-powered.html

Powered by Druid
“The performance is great ... some of the tables that we have
internally in Druid have billions and billions of events in them,
and we’re scanning them in under a second.”
12
Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html
From Yahoo:

Druid Key Features
● Low latency ingestion from Kafka
● Bulk load from Hadoop
● Can pre-aggregate data during ingestion
● “Schema light”
● Ad-hoc queries
● Exact and approximate algorithms
● Can keep a lot of history (years are ok)
13

Druid
Druid makes interactive data exploration fast and
flexible, and powers analytic applications.
14

What is NoSQL?
“There's no strong definition of the concept out there, no
trademarks, no standard group, not even a manifesto.”
16
Source: https://martinfowler.com/bliki/NosqlDefinition.html

What is NoSQL?
Early examples:
Voldemort, Cassandra, Dynomite,
HBase, Hypertable, CouchDB, MongoDB
17

What is NoSQL?
What are they?
● Document stores
● Key/value stores
● Graph databases
● Timeseries databases
18

What is NoSQL?
● Not using the relational model (nor the SQL language)
● Open source
● Designed to run on large clusters
● Based on the needs of 21st century web properties
● No schema, allowing fields to be added to any record without
controls
19

Categorizing Druid
● Open source
controls
20

Categorizing Druid
● Open source
controls
21

Categorizing Druid
● Open source
controls
22

Categorizing Druid
● Open source
controls
23

Categorizing Druid
● Open source
controls
24

Categorizing Druid
● Open source
controls
25

Categorizing Druid
Is avoiding the SQL language and
relational model really a good thing?
26

The Relational Model
● The relational model is based around relations
● SQL calls them tables and those tables have columns
● SQL queries describe relational operations
○ Scan
○ Project
○ Filter
○ Aggregate
○ Union
○ Join
27

The Relational Model
28
timestamp product_id user_id revenue
2030-01-01 212 1 180.00
2030-01-01 998 2 24.95
Table: “sales”
Table: “products”
id name
212 Office chair
998 Coffee mug, 2-pack
Table: “users”
id country city user_gender user_age
1 US New York F 34
2 FR Paris M 28

Druid and the Relational Model
29
timestamp product country city gender age revenue
2030-01-01 Office chair US New York F 34 180.00
2030-01-01 Coffee mug,
2-pack
FR Paris M 28 24.95
Datasource: “sales”

30
Datasource: “sales”
Lookup: “products”
id name
212 Office chair
timestamp product_id country city gender age revenue
2030-01-01 212 US New York F 34 180.00
2030-01-01 998 FR Paris M 28 24.95

● Datasources are like tables
○ Druid “lookups” apply to a common join use case
○ Big, flat tables are common in SQL databases anyway, when
analytical performance is critical
● Benefits of offering SQL
○ Developers and analysts know it
○ Integration with 3rd party apps
31

Apache Calcite
● SQL parser
● Query optimizer
● Query interpreter
● JDBC server (Avatica)
33

Apache Calcite
● Widely used
○ Druid
○ Hive
○ Storm
○ Samza
○ Drill
○ Phoenix
○ Flink
34

Apache Calcite
35
SQL
SqlNode
Parse tree
RelNode
Relational
operator tree
RelNode
Optimized in
target calling
convention

SQL query
SELECT dim1, COUNT(*)
FROM druid.foo
WHERE dim1 IN ('abc', 'def', 'ghi')
GROUP BY dim1
36

SQL parse tree
FROM druid.foo
GROUP BY dim1
37
Identifier“Select”
keyword
Operator
Identifier
Literal“Where”
keyword
“Group by”
keyword

Relational operators
FROM druid.foo
GROUP BY dim1
38
LogicalAggregate(group=[{0}], EXPR$1=[COUNT()])
LogicalProject(dim1=[$2])
LogicalFilter(condition=[OR(=($2, 'abc'), =($2, 'def'), =($2, 'ghi'))])
LogicalTableScan(table=[[druid, foo]])

Query planner
● Planner rules
○ Match certain relational operator patterns
○ Can transform one set of operators into another
○ New set must have same behavior, but may have a different cost
● HepPlanner (heuristic)
○ Applies all matching rules
● VolcanoPlanner (cost based)
○ Applies rules while searching for low cost plans
39

Using Calcite
Calcite can be embedded or it can
be used directly by end-users.
Druid SQL embeds Calcite.
40

Native vs SQL
{
"queryType": "topN",
"dataSource": “wikipedia”,
"dimension": "countryName",
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "channel",
"value": "#en.wikipedia",
"extractionFn": null
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "added",
"fieldName": "added"
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
channel = '#en.wikipedia'
AND countryName IS NOT NULL
AND __time BETWEEN '2018-03-01' AND '2018-03-06'
GROUP BY countryName
ORDER BY SUM(added) DESC
LIMIT 5
42

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
43

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
44

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
45

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
46

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
47

Native vs SQL
{
"metric": {
"type": "numeric",
"metric": "added"
},
"intervals": "2018-03-01/2018-03-06",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
},
{
"type": "not",
"field": {
"type": "selector",
"value": "",
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "added",
}
],
"threshold": 5
}
SELECT
countryName,
SUM(added)
FROM wikipedia
WHERE
LIMIT 5
48

SQL to Native translation
49
PartialDruidQuery
Scan
Filter
Project
Aggregate
Filter
Project
Sort
Druid’s query
execution
pipeline

FROM druid.foo
GROUP BY dim1
50
LogicalAggregate(group=[{0}], EXPR$1=[COUNT()])
LogicalFilter(condition=[OR(=($2, 'abc'), =($2, 'def'), =($2, 'ghi'))])

51
PartialDruidQuery
Scan(table=[[druid, foo]])
Filter(condition=[OR(=($2,
'abc'), =($2, 'def'), =($2, 'ghi'))])
Project(dim1=[$2])
Aggregate(group=[{0}],EXPR$1=[COUNT()])
Filter
Project
Sort
LogicalFilter(condition=[OR(=($2,
'abc'), =($2, 'def'), =($2, 'ghi'))])
LogicalAggregate(group=[{0}],EXPR$1=[COUNT()])

52
PartialDruidQuery
Filter
Project
Sort
{
"queryType" : "groupBy",
"dataSource" : “foo”,
"filter" : {
"type" : "in",
"dimension" : "dim1",
"values" : [ "abc", "def", "ghi" ]
},
"dimensions" : [ “dim1” ],
"aggregations" : [ {
"type" : "count",
"name" : "a0"
} ],
}
Scan(table=[[druid, foo]])
Filter(condition=[OR(=($2,
'abc'), =($2, 'def'), =($2, 'ghi'))])
Project(dim1=[$2])
Aggregate(group=[{0}],EXPR$1=[COUNT()])
toDruidQuery()

● Calcite implements:
○ SQL parser
○ Basic set of rules for reordering and combining operators
○ Rule-based optimizer frameworks
● Druid implements:
○ Construct Calcite catalog from Druid datasources
○ Cost functions guide reordering and combining operators
○ Rules to push operators one-by-one into a PartialDruidQuery
○ Convert PartialDruidQuery to DruidQuery
53

Minimal performance overhead.
Can even be faster due to transferring less data to the client!
54

Challenges: Writing good queries
SQL makes it surprisingly easy to write inefficient queries.
Databases strive to optimize as best as they can.
But the “EXPLAIN” tool is still essential.
55

Challenges: Schema-lightness
● Druid is schema-light (columns and their types are flexible)
● SQL model has tables and columns with specific types
● Druid native queries use type coercions at query time (e.g. user
specifies: treat column XYZ as “string”)
● Druid SQL populates catalog with latest metadata
56

Challenges: Lookups
Think back to lookups.
57
Lookup: “products”
id name
212 Office chair
timestamp product_id country city gender age revenue
2030-01-01 212 US New York F 34 180.00
2030-01-01 998 FR Paris M 28 24.95

Challenges: Lookups
SQL experts may think of this as a JOIN.
SELECT
products.name,
SUM(sales.revenue)
FROM sales JOIN products ON sales.product_id = products.id
GROUP BY products.name
58

Challenges: Lookups
Druid SQL does not support JOINs, but provides a “LOOKUP”
function instead.
SELECT
LOOKUP(id, ‘products’) AS product_name
SUM(sales.revenue)
FROM sales
GROUP BY product_name
59

Future work
● Druid features not supported in Druid SQL
○ Multi-value dimensions
○ Spatial filters
○ Theta sketches (approx. set intersection, differences)
● JOIN related
○ Allow users to write lookups as a SQL JOIN
○ Allow JOINs between two Druid datasources
● Others: SQL window functions, SQL UNION, GROUPING SETS
60

Download
Druid community site: http://druid.io/
Imply distribution: https://imply.io/get-started
62

Contribute
63
http://druid.io/community
https://github.com/druid-io/druid

Contribute
64
Druid has recently begun migration to the Apache Incubator.
Apache Druid is coming soon!

NoSQL no more: SQL on Druid with Apache Calcite

More Related Content

What's hot

Similar to NoSQL no more: SQL on Druid with Apache Calcite

Recently uploaded

NoSQL no more: SQL on Druid with Apache Calcite