A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
1. A Smarter Pig: Building a SQL
interface to Pig using Apache Calcite
Eli Levine & Julian Hyde
Apache: Big Data, Miami
2017/05/17
2. Julian Hyde @julianhyde
Original developer of Calcite
PMC member of Calcite, Drill, Eagle, Kylin
ASF member
About us
Eli Levine @teleturn
PMC member of Phoenix
ASF member
4. Apache Pig
Apache top-level project
Platform for Analyzing Large Datasets
➢ Uses Pig Latin language
○ Relational operators (join, filter)
○ Functional operators (mapreduce)
➢ Runs as MapReduce (also Tez)
➢ ETL
➢ Extensible
5. Outline
Batch compute on Force.com Platform (Eli Levine)
Apache Calcite deep dive (Julian Hyde)
Building Pig adapter for Calcite (Eli Levine)
Q&A
6. Salesforce Platform
Object-relational data model in the cloud
Contains standard objects that users can customize or add their own
SQL-like query language SOQL
- Real-time
- Batch compute
Federated data store: Oracle, HBase, external
User queries span data sources (federated joins)
SELECT DEPT.NAME
FROM EMPLOYEE
WHERE FIRST_NAME = ‘Eli’
7. Salesforce Platform - Batch Compute
Called Async SOQL
- REST API
- Users supply SOQL and info about where to deposit results
SOQL -> Pig Latin script
Pig loaders move data/computation to HDFS for federated query execution
Own SOQL parsing, no Calcite
9. Apache Calcite for Next-Gen Optimizer
- Strong relational algebra foundation
- Support for different physical engines
- Pluggable cost model
- Optimization rules
- Federation-aware
11. Calcite design
Design goals:
Not-just-SQL front end
Federation
Extensibility
Caching / hybrid storage
Materialized views
Design points:
tables on
disk
in-memory
materializations
select x, sum(y)
from t
group by x
12. Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
13. Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
15. Adapter
Implement SchemaFactory interface
Connect to a data source using
parameters
Extract schema - return a list of tables
Push down processing to the data source:
A set of planner rules
Calling convention (optional)
"schemas": [
{
"name": "HR",
"type": "custom",
"factory":
"org.apache.calcite.adapter.file.FileSchemaFactory",
"operand": {
"directory": "hr-csv"
}
}
]
$ ls -l hr-csv
-rw-r--r-- 1 jhyde staff 62 Mar 29 12:57 DEPTS.csv
-rw-r--r-- 1 jhyde staff 262 Mar 29 12:57 EMPS.csv.gz
$ ./sqlline -u jdbc:calcite:model=hr.json -n scott -p tiger
sqlline> select count(*) as c from emp;
'C'
'5'
1 row selected (0.135 seconds)
16. Calcite Pig Adapter
EMPLOYEE = LOAD 'EMPLOYEE' ... ;
EMPLOYEE = GROUP EMPLOYEE BY (DEPT_ID);
EMPLOYEE = FOREACH EMPLOYEE GENERATE COUNT(EMPLOYEE.DEPT_ID) as DEPT_ID__COUNT_,
group as DEPT_ID;
EMPLOYEE = FILTER EMPLOYEE BY (DEPT_ID__COUNT_ > 10);
SELECT DEPT_ID FROM EMPLOYEE GROUP BY DEPT_ID HAVING COUNT(DEPT_ID) > 10
17. Building the Pig Adapter
1. Implement Pig-specific RelNodes. e.g. PigFilter
2. RelNode factories
3. Write RelOptRules for converting abstract RelNodes to Pig RelNodes
4. Schema implementation
5. Unit tests run local Pig
18. Lessons Learned
Calcite is very flexible (both good and bad)
“Recipe list” would be useful
Lots of examples if you delve into existing adapters - e.g. Druid and Cassandra
Lots available out of the box
Dynamic code generation using Janino -- cryptic errors
RelBuilder was really useful (if you are building non-SQL engine)
ETL flow: directed asycl graph of data transformations
SFDC: enterprise cloud software: CRM, Service, etc. + Platform. Extensibility/customization important.
Platform allows extend and manipulate data for CRM, service, etc. and build fully custom business applications: query, reporting, CRUD, permissions/security, etc.
Customers/ISVs define custom data model and operate on it.
SOQL maps nicely to SQL. Similar to OQL/HQL. Implicit join notation. EMPLOYEE and DEPT could be backed by different data stores.
Current impl translates SOQL into Pig.
Pig is essentially the physical execution plan.
Pig works well for large queries
Large investment into Pig @ SFDC. Operational maturity.
Have started looking at modern engines: Spark, Flink, etc. More efficient, especially for smaller data sets that can be cached into mem and reused between steps
Additionally, want more flexible optimization capabilities, based on runtime info (e.g. stats)
Strong rel algebra foundation: writing query optimizers is hard. Another unseen benefit of Calcite is that it helps write better optimizer code: separation of concerns, clear responsibilities of components, extensible for adding custom rules, operators, etc.
Next gen query planner will allow choice of physical exec engines (Pig for M/R, Phoenix for simple, small queries, Spark) without having to rewrite everything.
Define optimization rules
Cost models based on runtime info (stats, other signals) available
Two rules have fired, filter-join-transpose-rule followed by splunk-filter-rule, and the query has become more efficient, because many fewer rows are coming out of Splunk. To truly optimize the query we would need to fire many more rules, and apply a cost model to choose which of the many equivalent plans is the best.
Transforms SQL queries into Pig scripts
This enables Calcite to build a tree of rel operators that know how to be executed in Pig
Can construct rel expression tree using Calcite’s RelBuilder instead of doing SQL parsing. Useful if don’t need parsing.
Calcite creates abstract expression trees. RelOptRule converters transform them into concrete ones (e.g. Cassandra, Pig)
Schema impl is needed to transform your data model into rel tables for Calcite
Verify script and data correctness
Calcite is a library with numerous components. It’s both good and bad. Good that this functionality is available. Challenge is finding what components actually needed to achieve specific goals.
What would help is something of a recipe list:
“To instruct Calcite how to run a physical plan and retrieve results need to implement EnumerableConverter”
Looking into existing adapters (e.g. Cassandra) useful.
Besides major components, like optimization engine, SQL parser, etc. Lots of reusable parts, especially optimizer rules.
Pig has nested collections (e.g. after GROUP). Easy to flatten for Pig adapter. Pig allows collections to be reused multiple times in same script. Flow does not map to (mostly) linear SQL.