Why is data independence (still) so important? Optiq and Apache Drill.

•Download as PPT, PDF•

2 likes•4,674 views

Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.

Technology

Why is data independence
(still) so important?
Julian Hyde @julianhyde

http://github.com/julianhyde/optiq
http://github.com/julianhyde/optiq-splunk

Apache Drill Meeting
2012/9/13

Data independence
This is my opinion about data management systems in general. I don't
claim that it is the right answer for Apache Drill.
I claim that a logical/physical separation can make a data management
system more widely applicable, therefore more widely adopted,
therefore better.
What “data independence” means in today's “big data” world.

About me
Julian Hyde

Database hacker (Oracle, Broadbase, SQLstream, LucidDB)
Open source hacker (Mondrian, olap4j, LucidDB, Optiq)

@julianhyde
http://github.com/julianhyde

http://www.flickr.com/photos/torkildr/3462606643

http://www.flickr.com/photos/sylvar/31436961/

“Big Data”
Right data, right time
Diverse data sources / Performance / Suitable format
Volume / Velocity / Variety

Volume – solved :)
Velocity – not one of Drill's goals (?)
Variety – ?

Variety
Variety of source formats (csv, avro, json, weblogs)
Variety of storage structures (indexes, projections, sort
order, materialized views) now or in future
Variety of query languages (DrQL, SQL)
Combine with other data (join, union)
Embed within other systems, e.g. Hive
Source for other systems, e.g. Drill | Cascading > Teradata
Tools generate SQL

Use case: Optiq* at Splunk
SQL interface on NoSQL system
“Smart” JDBC driver – pushes processing down to Splunk

* Truth in advertising: I am the author of Optiq.

Expression tree SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
GROUP BY p.”product_name”
Splunk ORDER BY c DESC

Table: splunk
Key: product_name
Key: product_id Agg: count
Condition: Key: c DESC
action =
'purchase'
scan
join
MySQL filter group sort
scan
Table: products

Expression tree SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
(optimized) JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
GROUP BY p.”product_name”
Splunk ORDER BY c DESC
Condition:
Table: splunk action =
'purchase' Key: product_name
Agg: count
Key: c DESC
Key: product_id
scan filter

MySQL
join group sort
scan
Table: products

Conventional DBMS architecture
JDBC client

JDBC server
SQL parser /
validator Metadata
Query
optimizer
Data-flow
operators

Data Data

Drill architecture
DrQL client

DrQL parser /
validator

?
Metadata

Data-flow
operators

Data Data

Optiq architecture
JDBC client

JDBC server
Optional SQL parser / Metadata
validator SPI
Core Query Pluggable
optimizer rules
3rd 3rd
Pluggable party party
ops ops
3rd party 3rd party
data data

Analogy: Compiler architecture

front end C++ C Fortran

middle end Optimizations

back end x86 ARM Fortran

Conclusions
Clear logical / physical separation allows a data
management system to handle a wider variety of data,
query languages, and packaging.
Also provides a clear interface between the sub-teams
working on query language and operators.
A query optimizer allows new operators, and alternative
algorithms and data structures, to be easily added to
the system.

Writing an adapter
Driver – if you want a vanity URL like “jdbc:drill:”
Schema – describes what tables exist
Table – what are the columns, and how to get the data.
Operators (optional) – non-relational operators, if any
Rules (optional, but recommended) – improve efficiency by changing the
question
Parser (optional) – additional source languages

What's hot

SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde

SQL for NoSQL and how Apache Calcite can helpChristian Tzolov

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

Optiq: a SQL front-end for everythingJulian Hyde

Introduction to Apache CalciteJordan Halterman

A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde

SQL on everything, in memoryJulian Hyde

Discardable In-Memory Materialized Queries With HadoopJulian Hyde

Apache Calcite: One Frontend to Rule Them AllMichael Mior

Calcite meetup-2016-04-20Josh Elser

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde

ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye

Spark sqlFreeman Zhang

Cost-based query optimization in Apache Hive 0.14Julian Hyde

Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde

Apache Calcite overviewJulian Hyde

Fast federated SQL with Apache CalciteChris Baynes

Why you care about  relational algebra (even though you didn’t know it)Julian Hyde

What's hot (20)

SQL Now! How Optiq brings the best of SQL to NoSQL data.

SQL for NoSQL and how Apache Calcite can help

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Apache Calcite (a tutorial given at BOSS '21)

Optiq: a SQL front-end for everything

Introduction to Apache Calcite

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite

SQL on everything, in memory

Discardable In-Memory Materialized Queries With Hadoop

Apache Calcite: One Frontend to Rule Them All

Calcite meetup-2016-04-20

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

ONE FOR ALL! Using Apache Calcite to make SQL smart

Spark sql

Cost-based query optimization in Apache Hive 0.14

Cost-based Query Optimization in Apache Phoenix using Apache Calcite

Apache Calcite overview

Fast federated SQL with Apache Calcite

Why you care about  relational algebra (even though you didn’t know it)

Viewers also liked

Data independenceAashima Wadhwa

physical and logical data independenceapoorva_upadhyay

Database management systemsMohammed El Hedhly

DbmsAbiramiK

A N S I S P A R C ArchitectureSabeeh Ahmed

DBMS an ExampleDr. C.V. Suresh Babu

Data Base Management SystemDr. C.V. Suresh Babu

Basic DBMS pptdangwalrajendra888

Dbms slidesrahulrathore725

Database management system presentationsameerraaj

Viewers also liked (10)

Data independence

physical and logical data independence

Database management systems

Dbms

A N S I S P A R C Architecture

DBMS an Example

Data Base Management System

Basic DBMS ppt

Dbms slides

Database management system presentation

Similar to Why is data independence (still) so important? Optiq and Apache Drill.

Spark Summit EU talk by Michael NitschingerSpark Summit

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Intro to Spark and Spark SQLjeykottalam

Scaling the Content Repository with ElasticsearchNuxeo

Spark Sql for TrainingBryan Yang

The Django Book - Chapter 5: ModelsSharon Chen

How Klout is changing the landscape of social media with Hadoop and BIDenny Lee

Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling

Projeto-web-services-Spring-Boot-JPA.pdfAdrianoSantos888423

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA

Scale By The Bay | 2020 | GimelDeepak Chandramouli

Hyperspace: An Indexing Subsystem for Apache SparkDatabricks

A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering

Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks

Odtug2011 adf developers make the database work for youLuc Bors

IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...In-Memory Computing Summit

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAmazon Web Services

Datacamp @ Transparency Camp 2010Knowerce

Similar to Why is data independence (still) so important? Optiq and Apache Drill. (20)

Spark Summit EU talk by Michael Nitschinger

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Intro to Spark and Spark SQL

Scaling the Content Repository with Elasticsearch

Spark Sql for Training

The Django Book - Chapter 5: Models

How Klout is changing the landscape of social media with Hadoop and BI

Ml ops and the feature store with hopsworks, DC Data Science Meetup

Projeto-web-services-Spring-Boot-JPA.pdf

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Scale By The Bay | 2020 | Gimel

Hyperspace: An Indexing Subsystem for Apache Spark

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite

Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...

Odtug2011 adf developers make the database work for you

IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit

Datacamp @ Transparency Camp 2010

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Architecting Cloud Native ApplicationsWSO2

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

MINDCTI Revenue Release Quarter One 2024MIND CTI

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

ICT role in 21st century education and its challengesrafiqahmad00786416

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

FWD Group - Insurer Innovation Award 2024The Digital Insurer

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Architecting Cloud Native Applications

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Understanding the FAA Part 107 License ..

AI in Action: Real World Use Cases by Anitaraj

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

MINDCTI Revenue Release Quarter One 2024

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

[BuildWithAI] Introduction to Gemini.pdf

ICT role in 21st century education and its challenges

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

FWD Group - Insurer Innovation Award 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Strategies for Landing an Oracle DBA Job as a Fresher

Vector Search -An Introduction in Oracle Database 23ai.pptx

WSO2's API Vision: Unifying Control, Empowering Developers

Why is data independence (still) so important? Optiq and Apache Drill.

1. Why is data independence (still) so important? Julian Hyde @julianhyde http://github.com/julianhyde/optiq http://github.com/julianhyde/optiq-splunk Apache Drill Meeting 2012/9/13

2. Data independence This is my opinion about data management systems in general. I don't claim that it is the right answer for Apache Drill. I claim that a logical/physical separation can make a data management system more widely applicable, therefore more widely adopted, therefore better. What “data independence” means in today's “big data” world.

3. About me Julian Hyde Database hacker (Oracle, Broadbase, SQLstream, LucidDB) Open source hacker (Mondrian, olap4j, LucidDB, Optiq) @julianhyde http://github.com/julianhyde

4. http://www.flickr.com/photos/torkildr/3462606643

5. http://www.flickr.com/photos/sylvar/31436961/

6. “Big Data” Right data, right time Diverse data sources / Performance / Suitable format Volume / Velocity / Variety Volume – solved :) Velocity – not one of Drill's goals (?) Variety – ?

7. Variety Variety of source formats (csv, avro, json, weblogs) Variety of storage structures (indexes, projections, sort order, materialized views) now or in future Variety of query languages (DrQL, SQL) Combine with other data (join, union) Embed within other systems, e.g. Hive Source for other systems, e.g. Drill | Cascading > Teradata Tools generate SQL

8. Use case: Optiq* at Splunk SQL interface on NoSQL system “Smart” JDBC driver – pushes processing down to Splunk * Truth in advertising: I am the author of Optiq.

9. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” Splunk ORDER BY c DESC Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products

10. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s (optimized) JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” Splunk ORDER BY c DESC Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products

11. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data

12. Drill architecture DrQL client DrQL parser / validator ? Metadata Data-flow operators Data Data

13. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data

14. Analogy: Compiler architecture front end C++ C Fortran middle end Optimizations back end x86 ARM Fortran

15. Conclusions Clear logical / physical separation allows a data management system to handle a wider variety of data, query languages, and packaging. Also provides a clear interface between the sub-teams working on query language and operators. A query optimizer allows new operators, and alternative algorithms and data structures, to be easily added to the system.

16. Extra material follows...

17. Writing an adapter Driver – if you want a vanity URL like “jdbc:drill:” Schema – describes what tables exist Table – what are the columns, and how to get the data. Operators (optional) – non-relational operators, if any Rules (optional, but recommended) – improve efficiency by changing the question Parser (optional) – additional source languages

Editor's Notes

The obligatory “big data” definition slide. What is “big data”? It's not really about “big”. We need to access data from different parts of the organization, when we need it (which often means we don't have time to copy it), and the performance needs to be reasonable. If the data is large, it is often larger than the disks one can fit on one machine. It helps if we can process the data in place, leveraging the CPU and memory of the machines where the data is stored. We'd rather not copy it from one system to another. It needs to be flexible, to deal with diverse systems and formats. That often means that open source is involved. Some systems (e.g. reporting tools) can't easily be changed to accommodate new formats. So it helps if the data can be presented in standard formats, e.g. SQL.
It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
Conventional database has ODBC/JDBC driver, SQL parser, . Data sources. Expression tree. Expression transformation rules. Optimizer. For NoSQL databases, the language may not be SQL, and the optimizer may be less sophisticated, but the picture is basically the same. For frameworks, such as Hadoop, there is no planner. You end up writing code (e.g MapReduce jobs).
Conventional database has ODBC/JDBC driver, SQL parser, . Data sources. Expression tree. Expression transformation rules. Optimizer. For NoSQL databases, the language may not be SQL, and the optimizer may be less sophisticated, but the picture is basically the same. For frameworks, such as Hadoop, there is no planner. You end up writing code (e.g MapReduce jobs).
In Optiq, the query optimizer (we modestly call it the planner) is central. The JDBC driver/server and SQL parser are optional; skip them if you have another language. Plug-ins provide metadata (the schema), planner rules, and runtime operators. There are built-in relational operators and rules, and there are built-in operators implemented in Java. But to access data, you need to provide at least one operator.

Why is data independence (still) so important? Optiq and Apache Drill.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Why is data independence (still) so important? Optiq and Apache Drill.

Similar to Why is data independence (still) so important? Optiq and Apache Drill. (20)

More from Julian Hyde

More from Julian Hyde (20)

Recently uploaded

Recently uploaded (20)

Why is data independence (still) so important? Optiq and Apache Drill.

Editor's Notes